How to use CrawlSpider in Scrapy

Catalog, archive, and documentation sites often repeat the same link shapes across listing pages and detail pages, so manual follow-up code quickly becomes repetitive and easy to broaden too far. CrawlSpider keeps that crawl flow readable by moving link discovery into ordered rules instead of hand-written follow logic on every response.

In current Scrapy releases, a CrawlSpider still combines Rule objects with LinkExtractor patterns so one rule can follow pagination while another sends matched detail pages to a callback. Rules are checked in order, the first matching rule claims each extracted link, and parse_start_url() remains the place to handle item extraction from the initial start_urls response when the seed page also contains data to keep.

Broad patterns can still crawl account pages, filtered search URLs, or other low-value branches, and OffsiteMiddleware drops followed requests when allowed_domains does not match the real host. Keep the extraction scope tight with options such as restrict_css or deny, leave parse() untouched in a CrawlSpider, and set an explicit callback on any new Request objects yielded from methods such as parse_item().

Steps to use CrawlSpider in Scrapy:

  1. Change to the root of the Scrapy project that will hold the spider.
    $ cd catalogdemo
  2. Generate a CrawlSpider skeleton in the project's spiders module.
    $ scrapy genspider -t crawl catalog_crawl app.internal.example
    Created spider 'catalog_crawl' using template 'crawl' in module:
      catalogdemo.spiders.catalog_crawl

    The crawl template creates a CrawlSpider with placeholder rules so the spider only needs the real host patterns, extraction scope, and item callback.

  3. Replace the generated spider with rules that match the real host, pagination URLs, and detail pages.
    catalogdemo/spiders/catalog_crawl.py
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
     
     
    class CatalogCrawlSpider(CrawlSpider):
        name = "catalog_crawl"
        allowed_domains = ["app.internal.example"]
        start_urls = ["https://app.internal.example/products/"]
     
        rules = (
            Rule(
                LinkExtractor(
                    allow=(r"/products/page/\d+/$",),
                    restrict_css=("main.catalog",),
                ),
                follow=True,
            ),
            Rule(
                LinkExtractor(
                    allow=(r"/products/[^/]+/$",),
                    restrict_css=("main.catalog",),
                ),
                callback="parse_item",
            ),
        )
     
        def parse_item(self, response):
            yield {
                "name": response.css("h1::text").get(),
                "price": response.css("p.price::text").get(),
                "url": response.url,
            }

    Use parse_start_url() when the seed URL also needs extraction, because the initial start_urls response is not sent to a rule callback.

    Keep parse() untouched in a CrawlSpider, and add an explicit callback to any new Request objects yielded from parse_item() or other methods so the response does not fall back into unexpected rule dispatch.

  4. Run the spider with feed export enabled to confirm the rules follow both listing pages and detail pages.
    $ scrapy crawl catalog_crawl -O catalog.json
    2026-04-22 05:55:41 [scrapy.utils.log] INFO: Scrapy 2.15.0 started (bot: catalogdemo)
    2026-04-22 05:55:50 [scrapy.core.engine] INFO: Spider opened
    ##### snipped #####
    2026-04-22 05:55:57 [scrapy.extensions.feedexport] INFO: Stored json feed (3 items) in: catalog.json
    2026-04-22 05:55:57 [scrapy.core.engine] INFO: Spider closed (finished)

    -O overwrites the export file on each run, while -o appends to an existing feed.

  5. Read the exported items to confirm the callback is scraping only the intended detail pages.
    $ cat catalog.json
    [
    {"name": "Team Plan", "price": "$79", "url": "https://app.internal.example/products/team-plan/"},
    {"name": "Starter Plan", "price": "$29", "url": "https://app.internal.example/products/starter-plan/"},
    {"name": "Enterprise Plan", "price": "$199", "url": "https://app.internal.example/products/enterprise-plan/"}
    ]

    If the export includes listing URLs, account pages, or filtered search results, narrow the allow patterns, add deny rules, or tighten restrict_css so LinkExtractor only sees the part of the page that contains crawl targets.