Item loaders keep scraped data consistent by applying the same cleanup rules to every extracted field before export or storage. Centralized normalization reduces repeated parsing logic in spiders and makes item output easier to validate across different pages.
Scrapy's ItemLoader collects values from selectors and applies input and output processors to each field before returning the final item. Values added with add_css, add_xpath, or add_value flow through processors such as MapCompose for per-value cleanup and TakeFirst for collapsing a list to a single value.
Processors are order-sensitive and can silently change meaning when they drop empty values or discard additional matches. Treat multi-value fields explicitly, keep transformations small and predictable, and validate selectors in scrapy shell when a field returns blank or unexpected results.
Related: How to use Scrapy shell \ Related: How to use CSS selectors in Scrapy
Steps to use Item Loaders in Scrapy:
- Open the project's item definitions file.
$ vi catalog_demo/items.py
- Define an item class with fields matching the target data.
import scrapy class ProductItem(scrapy.Item): name = scrapy.Field() price = scrapy.Field() url = scrapy.Field()
- Create a loader class with default processors for the item.
$ vi catalog_demo/loaders.py
import re from itemloaders import ItemLoader from itemloaders.processors import MapCompose, TakeFirst def parse_price(value: str) -> str: cleaned = re.sub(r"[^\d.]", "", value) return cleaned class ProductLoader(ItemLoader): default_input_processor = MapCompose(str.strip) default_output_processor = TakeFirst() price_in = MapCompose(str.strip, parse_price)
default_input_processor runs for every field unless a field-specific processor (<field>_in) is set.
- Update the spider to populate items through the loader.
$ vi catalog_demo/spiders/catalog_loader.py
import scrapy from itemloaders.processors import MapCompose from catalog_demo.items import ProductItem from catalog_demo.loaders import ProductLoader class CatalogLoaderSpider(scrapy.Spider): name = "catalog_loader" allowed_domains = ["app.internal.example"] start_urls = ["http://app.internal.example:8000/products/"] def parse(self, response): for product in response.css("article.product"): loader = ProductLoader(item=ProductItem(), selector=product) loader.add_css("name", "h2::text") loader.add_css("price", ".price::text") loader.add_css("url", "a.detail::attr(href)", MapCompose(response.urljoin)) yield loader.load_item()
Passing response.urljoin into MapCompose converts relative links to absolute URLs.
- Run the spider with feed export enabled to write items to a JSON file.
$ scrapy crawl catalog_loader -O catalog.json 2026-01-01 09:38:49 [scrapy.extensions.feedexport] INFO: Stored json feed (6 items) in: catalog.json
- Inspect the exported JSON for cleaned fields with expected types.
$ head -n 6 catalog.json [ {"name": "Starter Plan", "price": "29", "url": "http://app.internal.example:8000/products/starter-plan.html"}, {"name": "Team Plan", "price": "79", "url": "http://app.internal.example:8000/products/team-plan.html"}, {"name": "Enterprise Plan", "price": "199", "url": "http://app.internal.example:8000/products/enterprise-plan.html"}, {"name": "Growth Plan", "price": "129", "url": "http://app.internal.example:8000/products/growth-plan.html"}, {"name": "Agency Plan", "price": "249", "url": "http://app.internal.example:8000/products/agency-plan.html"}, ##### snipped #####
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
