How to use Item Loaders in Scrapy

Item loaders keep scraped data consistent by applying the same cleanup rules to every extracted field before export or storage. Centralized normalization reduces repeated parsing logic in spiders and makes item output easier to validate across different pages.

Scrapy's ItemLoader collects values from selectors and applies input and output processors to each field before returning the final item. Values added with add_css, add_xpath, or add_value flow through processors such as MapCompose for per-value cleanup and TakeFirst for collapsing a list to a single value.

Processors are order-sensitive and can silently change meaning when they drop empty values or discard additional matches. Treat multi-value fields explicitly, keep transformations small and predictable, and validate selectors in scrapy shell when a field returns blank or unexpected results.

Steps to use Item Loaders in Scrapy:

Open the project's item definitions file.
```
$ vi catalog_demo/items.py
```

Define an item class with fields matching the target data.

import scrapy
 
class ProductItem(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    url = scrapy.Field()

Create a loader class with default processors for the item.

$ vi catalog_demo/loaders.py

import re
 
from itemloaders import ItemLoader
from itemloaders.processors import MapCompose, TakeFirst
 
def parse_price(value: str) -> str:
    cleaned = re.sub(r"[^\d.]", "", value)
    return cleaned
 
class ProductLoader(ItemLoader):
    default_input_processor = MapCompose(str.strip)
    default_output_processor = TakeFirst()
    price_in = MapCompose(str.strip, parse_price)

default_input_processor runs for every field unless a field-specific processor (<field>_in) is set.

Update the spider to populate items through the loader.

$ vi catalog_demo/spiders/catalog_loader.py

import scrapy
from itemloaders.processors import MapCompose
 
from catalog_demo.items import ProductItem
from catalog_demo.loaders import ProductLoader
 
class CatalogLoaderSpider(scrapy.Spider):
    name = "catalog_loader"
    allowed_domains = ["app.internal.example"]
    start_urls = ["http://app.internal.example:8000/products/"]
 
    def parse(self, response):
        for product in response.css("article.product"):
            loader = ProductLoader(item=ProductItem(), selector=product)
            loader.add_css("name", "h2::text")
            loader.add_css("price", ".price::text")
            loader.add_css("url", "a.detail::attr(href)", MapCompose(response.urljoin))
            yield loader.load_item()

Passing response.urljoin into MapCompose converts relative links to absolute URLs.

Run the spider with feed export enabled to write items to a JSON file.

$ scrapy crawl catalog_loader -O catalog.json
2026-01-01 09:38:49 [scrapy.extensions.feedexport] INFO: Stored json feed (6 items) in: catalog.json

Inspect the exported JSON for cleaned fields with expected types.

$ head -n 6 catalog.json
[
{"name": "Starter Plan", "price": "29", "url": "http://app.internal.example:8000/products/starter-plan.html"},
{"name": "Team Plan", "price": "79", "url": "http://app.internal.example:8000/products/team-plan.html"},
{"name": "Enterprise Plan", "price": "199", "url": "http://app.internal.example:8000/products/enterprise-plan.html"},
{"name": "Growth Plan", "price": "129", "url": "http://app.internal.example:8000/products/growth-plan.html"},
{"name": "Agency Plan", "price": "249", "url": "http://app.internal.example:8000/products/agency-plan.html"},
##### snipped #####

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.