Item loaders keep scraped data consistent by applying the same cleanup rules to every extracted field before export or storage. Centralized normalization reduces repeated parsing logic in spiders and makes item output easier to validate across different pages.
Scrapy's ItemLoader collects values from selectors and applies input and output processors to each field before returning the final item. Values added with add_css, add_xpath, or add_value flow through processors such as MapCompose for per-value cleanup and TakeFirst for collapsing a list to a single value.
Processors are order-sensitive and can silently change meaning when they drop empty values or discard additional matches. Treat multi-value fields explicitly, keep transformations small and predictable, and validate selectors in scrapy shell when a field returns blank or unexpected results.
Related: How to use Scrapy shell \ Related: How to use CSS selectors in Scrapy
$ vi catalog_demo/items.py
import scrapy class ProductItem(scrapy.Item): name = scrapy.Field() price = scrapy.Field() url = scrapy.Field()
$ vi catalog_demo/loaders.py
import re from itemloaders import ItemLoader from itemloaders.processors import MapCompose, TakeFirst def parse_price(value: str) -> str: cleaned = re.sub(r"[^\d.]", "", value) return cleaned class ProductLoader(ItemLoader): default_input_processor = MapCompose(str.strip) default_output_processor = TakeFirst() price_in = MapCompose(str.strip, parse_price)
default_input_processor runs for every field unless a field-specific processor (<field>_in) is set.
$ vi catalog_demo/spiders/catalog_loader.py
import scrapy from itemloaders.processors import MapCompose from catalog_demo.items import ProductItem from catalog_demo.loaders import ProductLoader class CatalogLoaderSpider(scrapy.Spider): name = "catalog_loader" allowed_domains = ["app.internal.example"] start_urls = ["http://app.internal.example:8000/products/"] def parse(self, response): for product in response.css("article.product"): loader = ProductLoader(item=ProductItem(), selector=product) loader.add_css("name", "h2::text") loader.add_css("price", ".price::text") loader.add_css("url", "a.detail::attr(href)", MapCompose(response.urljoin)) yield loader.load_item()
Passing response.urljoin into MapCompose converts relative links to absolute URLs.
$ scrapy crawl catalog_loader -O catalog.json 2026-01-01 09:38:49 [scrapy.extensions.feedexport] INFO: Stored json feed (6 items) in: catalog.json
$ head -n 6 catalog.json
[
{"name": "Starter Plan", "price": "29", "url": "http://app.internal.example:8000/products/starter-plan.html"},
{"name": "Team Plan", "price": "79", "url": "http://app.internal.example:8000/products/team-plan.html"},
{"name": "Enterprise Plan", "price": "199", "url": "http://app.internal.example:8000/products/enterprise-plan.html"},
{"name": "Growth Plan", "price": "129", "url": "http://app.internal.example:8000/products/growth-plan.html"},
{"name": "Agency Plan", "price": "249", "url": "http://app.internal.example:8000/products/agency-plan.html"},
##### snipped #####