Item loaders keep field cleanup and shaping in one place so spiders can stay focused on finding data. Moving whitespace stripping, price normalization, and similar cleanup into a loader makes exported items easier to compare across pages and easier to reuse across multiple spiders.
Scrapy's ItemLoader collects values from add_css(), add_xpath(), or add_value(), stores them internally as lists, and applies output processors when load_item() is called. A loader class can define project-wide defaults such as MapCompose(str.strip) and then override a single field with a more specific processor such as price_in when one field needs extra cleanup.
Item loaders do not fix bad selectors or missing item fields on their own. Keep processors small and predictable, define the target fields first, and verify the exported result after wiring the loader into the spider so a field-specific processor does not quietly drop or reshape values in the wrong way.
Related: How to use Scrapy shell
Related: How to use CSS selectors in Scrapy
Steps to use Item Loaders in Scrapy:
- Open the project's item definitions file and declare the fields the loader will populate.
$ vi catalog_demo/items.py
import scrapy class ProductItem(scrapy.Item): name = scrapy.Field() price = scrapy.Field() url = scrapy.Field()
Related: How to define item fields in Scrapy
- Create a loader class with default cleanup rules and a field-specific processor for the price value.
$ vi catalog_demo/loaders.py
import re from itemloaders.processors import MapCompose, TakeFirst from scrapy.loader import ItemLoader def parse_price(value: str) -> str: return re.sub(r"[^\d.]", "", value) class ProductLoader(ItemLoader): default_input_processor = MapCompose(str.strip) default_output_processor = TakeFirst() price_in = MapCompose(str.strip, parse_price)
price_in takes precedence over default_input_processor for the price field, so the currency cleanup stays isolated to that field.
- Update the spider to create one loader per product selector and add each extracted value to it.
$ vi catalog_demo/spiders/catalog_loader.py
import scrapy from itemloaders.processors import MapCompose from catalog_demo.items import ProductItem from catalog_demo.loaders import ProductLoader class CatalogLoaderSpider(scrapy.Spider): name = "catalog_loader" allowed_domains = ["catalog.internal.example"] start_urls = ["http://catalog.internal.example/"] def parse(self, response): for product in response.css("article.product"): loader = ProductLoader(item=ProductItem(), selector=product) loader.add_css("name", "h2::text") loader.add_css("price", ".price::text") loader.add_css("url", "a.detail::attr(href)", MapCompose(response.urljoin)) yield loader.load_item()
Passing selector=product keeps each add_css() call scoped to one product block instead of the whole response.
- Run the spider and export the normalized items to a JSON file.
$ scrapy crawl catalog_loader -O products.json ##### snipped ##### 2026-04-16 05:46:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://catalog.internal.example/> {'name': 'Starter Plan', 'price': '29', 'url': 'http://catalog.internal.example/products/starter-plan.html'} 2026-04-16 05:46:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://catalog.internal.example/> {'name': 'Team Plan', 'price': '79', 'url': 'http://catalog.internal.example/products/team-plan.html'} 2026-04-16 05:46:18 [scrapy.extensions.feedexport] INFO: Stored json feed (2 items) in: products.jsonThe crawl log proves the loader cleaned the price field before feed export wrote the item.
- Open the exported file and confirm each field contains the cleaned final value.
$ cat products.json [ {"name": "Starter Plan", "price": "29", "url": "http://catalog.internal.example/products/starter-plan.html"}, {"name": "Team Plan", "price": "79", "url": "http://catalog.internal.example/products/team-plan.html"} ]TakeFirst() suits single-value fields such as name or price; keep a list-based output processor for tags, categories, or repeated links that should not collapse to one value.
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
