Item pipelines make scraped data usable by applying consistent cleaning, validation, and persistence after extraction. Enabling a pipeline is a practical way to trim whitespace, normalize types, drop incomplete records, or store items externally without turning spider code into a maintenance trap.
In Scrapy, each item yielded by a spider is passed through the pipeline chain defined in ITEM_PIPELINES. A pipeline is a Python class implementing process_item, and Scrapy executes enabled pipelines in priority order (lower numbers run earlier) before items are exported to feeds or processed by downstream components.
Ordering matters when one pipeline depends on changes made by another, and failures are immediate: unhandled exceptions can stop the crawl and intentional filtering requires raising DropItem. Keep pipeline logic focused and deterministic, and validate changes with short test runs before long crawls to avoid silently damaging output.
Related: How to export Scrapy items to JSON
Related: How to download files with Scrapy
Steps to enable item pipelines in Scrapy:
- Open the project's pipelines file.
$ vi catalog_demo/pipelines.py
Replace catalog_demo with the Scrapy project package name.
- Add a pipeline class to normalize fields.
from itemadapter import ItemAdapter class CleanNamePipeline: def process_item(self, item, spider): adapter = ItemAdapter(item) name = adapter.get("name") if name: adapter["name"] = str(name).strip() return item
ItemAdapter supports both Scrapy Item objects and plain dict items.
- Enable the pipeline in settings.py with a priority.
ITEM_PIPELINES = { "catalog_demo.pipelines.CleanNamePipeline": 300, }
Lower priority numbers run earlier; add more entries to chain multiple pipelines.
An incorrect module path in ITEM_PIPELINES can prevent the spider from starting due to an ImportError.
- Run the spider with JSON export enabled.
$ scrapy crawl catalog -O products.json 2026-01-01 09:38:57 [scrapy.middleware] INFO: Enabled item pipelines: ['catalog_demo.pipelines.CleanNamePipeline']
Add -s CLOSESPIDER_ITEMCOUNT=10 for quick pipeline testing on a small crawl.
- Inspect the output to confirm cleaned fields.
$ head -n 4 products.json [ {"name": "Starter Plan", "price": "$29", "url": "http://app.internal.example:8000/products/starter-plan.html"}, {"name": "Team Plan", "price": "$79", "url": "http://app.internal.example:8000/products/team-plan.html"}, {"name": "Enterprise Plan", "price": "$199", "url": "http://app.internal.example:8000/products/enterprise-plan.html"}, ##### snipped #####
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
