Item pipelines make scraped data usable by applying consistent cleaning, validation, and persistence after extraction. Enabling a pipeline is a practical way to trim whitespace, normalize types, drop incomplete records, or store items externally without turning spider code into a maintenance trap.

In Scrapy, each item yielded by a spider is passed through the pipeline chain defined in ITEM_PIPELINES. A pipeline is a Python class implementing process_item, and Scrapy executes enabled pipelines in priority order (lower numbers run earlier) before items are exported to feeds or processed by downstream components.

Ordering matters when one pipeline depends on changes made by another, and failures are immediate: unhandled exceptions can stop the crawl and intentional filtering requires raising DropItem. Keep pipeline logic focused and deterministic, and validate changes with short test runs before long crawls to avoid silently damaging output.

Steps to enable item pipelines in Scrapy:

  1. Open the project's pipelines file.
    $ vi catalog_demo/pipelines.py

    Replace catalog_demo with the Scrapy project package name.

  2. Add a pipeline class to normalize fields.
    from itemadapter import ItemAdapter
     
     
    class CleanNamePipeline:
        def process_item(self, item, spider):
            adapter = ItemAdapter(item)
            name = adapter.get("name")
            if name:
                adapter["name"] = str(name).strip()
            return item

    ItemAdapter supports both Scrapy Item objects and plain dict items.

  3. Enable the pipeline in settings.py with a priority.
    ITEM_PIPELINES = {
        "catalog_demo.pipelines.CleanNamePipeline": 300,
    }

    Lower priority numbers run earlier; add more entries to chain multiple pipelines.

    An incorrect module path in ITEM_PIPELINES can prevent the spider from starting due to an ImportError.

  4. Run the spider with JSON export enabled.
    $ scrapy crawl catalog -O products.json
    2026-01-01 09:38:57 [scrapy.middleware] INFO: Enabled item pipelines:
    ['catalog_demo.pipelines.CleanNamePipeline']

    Add -s CLOSESPIDER_ITEMCOUNT=10 for quick pipeline testing on a small crawl.

  5. Inspect the output to confirm cleaned fields.
    $ head -n 4 products.json
    [
    {"name": "Starter Plan", "price": "$29", "url": "http://app.internal.example:8000/products/starter-plan.html"},
    {"name": "Team Plan", "price": "$79", "url": "http://app.internal.example:8000/products/team-plan.html"},
    {"name": "Enterprise Plan", "price": "$199", "url": "http://app.internal.example:8000/products/enterprise-plan.html"},
    ##### snipped #####