Item pipelines let a Scrapy project clean, validate, enrich, or drop items after a spider yields them and before feed exports or storage backends write the result. Moving that logic out of spider callbacks keeps the extraction code focused on collecting values while one reusable pipeline class handles the post-processing rules.
Scrapy enables pipelines through the ITEM_PIPELINES setting in settings.py. Each active class runs in sequence from the lowest integer value to the highest one, and a basic pipeline can implement process_item(self, item) to update the item or raise DropItem when the record should stop there.
Current scrapy startproject output still leaves the ITEM_PIPELINES block commented out in settings.py, and some generated pipeline stubs still require a spider argument in process_item(). Current Scrapy treats that required spider argument as deprecated, so update the stub before enabling it and then confirm the crawl log shows the pipeline under Enabled item pipelines.
Related: How to export Scrapy items to JSON
Related: How to download files with Scrapy
Steps to enable item pipelines in Scrapy:
- Open the project's pipeline module.
$ vi catalogdemo/pipelines.py
Replace catalogdemo with the Scrapy project package name that sits next to scrapy.cfg.
- Replace the generated stub with a pipeline class that trims surrounding spaces and drops empty names.
from itemadapter import ItemAdapter from scrapy.exceptions import DropItem class CleanNamePipeline: def process_item(self, item): adapter = ItemAdapter(item) cleaned_name = str(adapter.get("name", "")).strip() if not cleaned_name: raise DropItem("Missing name") adapter["name"] = cleaned_name return item
ItemAdapter keeps the same pipeline working with Scrapy Item objects and plain Python dict items.
- Uncomment or add the ITEM_PIPELINES setting in settings.py so Scrapy can load the class.
ITEM_PIPELINES = { "catalogdemo.pipelines.CleanNamePipeline": 300, }
Lower numbers run earlier, so 300 is a common place for a first cleanup or validation stage.
If the dotted path does not match the real module and class name, the crawl stops before any items reach the pipeline.
- Run a short JSON Lines export and confirm the crawl log lists the pipeline as enabled.
$ scrapy crawl catalog -O products.jl ##### snipped ##### 2026-04-22 05:51:14 [scrapy.middleware] INFO: Enabled item pipelines: ['catalogdemo.pipelines.CleanNamePipeline'] 2026-04-22 05:51:14 [scrapy.core.scraper] WARNING: Dropped: Missing name 2026-04-22 05:51:14 [scrapy.extensions.feedexport] INFO: Stored jl feed (2 items) in: products.jl
The JSON Lines export makes it easy to see which items survived the pipeline because each accepted item is written as one line.
- Open the exported file to confirm the cleaned items reached the feed with the intended whitespace trim.
$ cat products.jl {"name": "Starter Plan"} {"name": "Team Plan"}If an expected item is missing, check the crawl log for the DropItem reason raised by the pipeline.
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
