Item loaders keep scraped data consistent by applying the same cleanup rules to every extracted field before export or storage. Centralized normalization reduces repeated parsing logic in spiders and makes item output easier to validate across different pages.

Scrapy's ItemLoader collects values from selectors and applies input and output processors to each field before returning the final item. Values added with add_css, add_xpath, or add_value flow through processors such as MapCompose for per-value cleanup and TakeFirst for collapsing a list to a single value.

Processors are order-sensitive and can silently change meaning when they drop empty values or discard additional matches. Treat multi-value fields explicitly, keep transformations small and predictable, and validate selectors in scrapy shell when a field returns blank or unexpected results.

Steps to use Item Loaders in Scrapy:

  1. Open the project's item definitions file.
    $ vi catalog_demo/items.py
  2. Define an item class with fields matching the target data.
    import scrapy
     
    class ProductItem(scrapy.Item):
        name = scrapy.Field()
        price = scrapy.Field()
        url = scrapy.Field()
  3. Create a loader class with default processors for the item.
    $ vi catalog_demo/loaders.py
    import re
     
    from itemloaders import ItemLoader
    from itemloaders.processors import MapCompose, TakeFirst
     
    def parse_price(value: str) -> str:
        cleaned = re.sub(r"[^\d.]", "", value)
        return cleaned
     
    class ProductLoader(ItemLoader):
        default_input_processor = MapCompose(str.strip)
        default_output_processor = TakeFirst()
        price_in = MapCompose(str.strip, parse_price)

    default_input_processor runs for every field unless a field-specific processor (<field>_in) is set.

  4. Update the spider to populate items through the loader.
    $ vi catalog_demo/spiders/catalog_loader.py
    import scrapy
    from itemloaders.processors import MapCompose
     
    from catalog_demo.items import ProductItem
    from catalog_demo.loaders import ProductLoader
     
    class CatalogLoaderSpider(scrapy.Spider):
        name = "catalog_loader"
        allowed_domains = ["app.internal.example"]
        start_urls = ["http://app.internal.example:8000/products/"]
     
        def parse(self, response):
            for product in response.css("article.product"):
                loader = ProductLoader(item=ProductItem(), selector=product)
                loader.add_css("name", "h2::text")
                loader.add_css("price", ".price::text")
                loader.add_css("url", "a.detail::attr(href)", MapCompose(response.urljoin))
                yield loader.load_item()

    Passing response.urljoin into MapCompose converts relative links to absolute URLs.

  5. Run the spider with feed export enabled to write items to a JSON file.
    $ scrapy crawl catalog_loader -O catalog.json
    2026-01-01 09:38:49 [scrapy.extensions.feedexport] INFO: Stored json feed (6 items) in: catalog.json
  6. Inspect the exported JSON for cleaned fields with expected types.
    $ head -n 6 catalog.json
    [
    {"name": "Starter Plan", "price": "29", "url": "http://app.internal.example:8000/products/starter-plan.html"},
    {"name": "Team Plan", "price": "79", "url": "http://app.internal.example:8000/products/team-plan.html"},
    {"name": "Enterprise Plan", "price": "199", "url": "http://app.internal.example:8000/products/enterprise-plan.html"},
    {"name": "Growth Plan", "price": "129", "url": "http://app.internal.example:8000/products/growth-plan.html"},
    {"name": "Agency Plan", "price": "249", "url": "http://app.internal.example:8000/products/agency-plan.html"},
    ##### snipped #####