How to use Item Loaders in Scrapy

Item loaders keep field cleanup close to the item schema so a spider callback can stay focused on selectors and link traversal. That becomes useful as soon as one field needs repeatable trimming, prefix cleanup, or URL normalization across more than one response.

In current Scrapy, ItemLoader collects values added through add_css(), add_xpath(), or add_value() into per-field lists, then applies the output processor when load_item() returns the finished item. Scrapy's loader class extends the underlying itemloaders library, so the supported import path is from scrapy.loader import ItemLoader while processors such as MapCompose and TakeFirst come from itemloaders.processors.

Loaders do not fix weak selectors or missing fields on their own. Keep processors small, use field-specific cleanup only where the field genuinely needs it, and confirm the resulting item on one real response before moving the loader into a wider crawl.

Steps to use Item Loaders in Scrapy:

  1. Create a one-file spider that defines the item, the loader, and one loader-driven callback against Scrapy's sample page.
    loader_spider.py
    import scrapy
    from itemloaders.processors import MapCompose, TakeFirst
    from scrapy.loader import ItemLoader
     
     
    def normalize_label(value: str) -> str:
        return value.removeprefix("Name:").strip()
     
     
    class ImageLinkItem(scrapy.Item):
        label = scrapy.Field()
        href = scrapy.Field()
     
     
    class ImageLinkLoader(ItemLoader):
        default_input_processor = MapCompose(str.strip)
        default_output_processor = TakeFirst()
        label_in = MapCompose(str.strip, normalize_label)
     
     
    class LoaderSpider(scrapy.Spider):
        name = "loader"
        custom_settings = {
            "ROBOTSTXT_OBEY": False,
        }
        start_urls = [
            (
                "https://docs.scrapy.org/en/latest/_static/"
                "selectors-sample1.html"
            ),
        ]
     
        def parse(self, response):
            for link in response.css("#images a"):
                loader = ImageLinkLoader(
                    item=ImageLinkItem(),
                    selector=link,
                )
                loader.add_css("label", "::text")
                loader.add_css(
                    "href",
                    "::attr(href)",
                    MapCompose(response.urljoin),
                )
                yield loader.load_item()

    label_in overrides the default input processor only for the label field, while the per-call MapCompose(response.urljoin) keeps the extracted href value absolute. Related: How to define item fields in Scrapy

  2. Run the spider and write the normalized items to a JSON file.
    $ scrapy runspider loader_spider.py -O image-links.json
    ##### snipped #####
    2026-04-22 06:44:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://docs.scrapy.org/en/latest/_static/selectors-sample1.html>
    {'href': 'http://example.com/image1.html', 'label': 'My image 1'}
    2026-04-22 06:44:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://docs.scrapy.org/en/latest/_static/selectors-sample1.html>
    {'href': 'http://example.com/image2.html', 'label': 'My image 2'}
    ##### snipped #####
    2026-04-22 06:44:52 [scrapy.extensions.feedexport] INFO: Stored json feed (5 items) in: image-links.json

    scrapy runspider is useful for a quick loader test because the Item, loader, and spider can stay in one file until the selectors and processors are stable.

  3. Open the export and confirm the label field lost the Name: prefix while the href field stayed absolute.
    $ cat image-links.json
    [
    {"label": "My image 1", "href": "http://example.com/image1.html"},
    {"label": "My image 2", "href": "http://example.com/image2.html"},
    {"label": "My image 3", "href": "http://example.com/image3.html"},
    {"label": "My image 4", "href": "http://example.com/image4.html"},
    {"label": "My image 5", "href": "http://example.com/image5.html"}
    ]

    TakeFirst() suits single-value fields such as label or href, but repeated fields such as tags or multiple links should keep a list-oriented output processor instead of collapsing to one value.