Item pipelines let a Scrapy project clean, validate, or store each scraped item after the spider yields it. They keep repeated data handling out of spider callbacks and make the item flow easier to reuse across multiple spiders.

Scrapy sends items through the classes listed in ITEM_PIPELINES, from lower priority numbers to higher ones. Each enabled pipeline implements process_item(), can update the item through ItemAdapter, and can stop bad records by raising DropItem instead of passing them to later pipelines or feed exports.

A current scrapy startproject template leaves the ITEM_PIPELINES example commented out in settings.py, so enabling a new pipeline usually means adding or uncommenting that setting and pointing it at the correct class path. A wrong import path or an exception inside the pipeline can stop the crawl or silently drop records, so a short exported test run is the quickest way to confirm the pipeline is active and doing the intended cleanup.

Steps to enable item pipelines in Scrapy:

  1. Open the project pipeline module.
    $ vi catalog_demo/pipelines.py

    Replace catalog_demo with the Python package name created by scrapy startproject.

  2. Add a pipeline class that strips whitespace and drops empty names before export.
    from itemadapter import ItemAdapter
    from scrapy.exceptions import DropItem
     
    class CleanNamePipeline:
        def process_item(self, item):
            adapter = ItemAdapter(item)
            cleaned_name = str(adapter.get("name", "")).strip()
            if not cleaned_name:
                raise DropItem("Missing name")
            adapter["name"] = cleaned_name
            return item

    ItemAdapter keeps the same pipeline working for Scrapy Item objects and plain dict items.

  3. Add or uncomment the ITEM_PIPELINES setting in settings.py.
    ITEM_PIPELINES = {
        "catalog_demo.pipelines.CleanNamePipeline": 300,
    }

    Lower numbers run earlier, so use the smallest number for the first cleanup or validation stage in a longer chain.

    If the import path does not match the real module and class name, Scrapy cannot enable the pipeline and the crawl exits before items are processed.

  4. Run the spider with a short feed export to confirm the pipeline is enabled.
    $ scrapy crawl catalog -O products.jl -s CLOSESPIDER_ITEMCOUNT=2
    2026-04-16 05:32:19 [scrapy.middleware] INFO: Enabled item pipelines:
    ['catalog_demo.pipelines.CleanNamePipeline']
    2026-04-16 05:32:19 [scrapy.extensions.feedexport] INFO: Stored jl feed (2 items) in: products.jl

    products.jl uses JSON Lines, so each scraped item is written as one line and can be inspected with a plain cat command.

  5. Open the exported feed to confirm the cleaned values were written.
    $ cat products.jl
    {"name": "Starter Plan"}
    {"name": "Team Plan"}

    If a record disappears from the exported file, check the crawl log for a DropItem message from the pipeline.