Catalog, archive, and documentation sites often repeat the same link structure across listing pages and detail pages, so hand-written response.follow() logic quickly becomes noisy and easy to broaden by mistake. CrawlSpider keeps that crawl logic readable by moving link-following decisions into a small set of reusable rules.

In Scrapy, a CrawlSpider applies Rule objects in order, using LinkExtractor patterns to decide which links should be followed and which responses should be sent to an item callback. The start URLs still seed the crawl normally, and parse_start_url() is available when the first response also needs extraction logic.

Broad rules can still explode into account pages, filtered search URLs, or calendar loops, and the first matching rule wins when more than one rule matches the same link. Keep allowed_domains tight, limit link extraction to the part of the page that actually contains crawl targets, leave parse() alone, and set explicit callbacks on any new Request objects yielded from CrawlSpider methods.

Steps to use CrawlSpider in Scrapy:

  1. Change to the root of the Scrapy project that will hold the spider.
    $ cd catalogdemo
  2. Generate a CrawlSpider skeleton inside the project's spiders module.
    $ scrapy genspider -t crawl catalog_crawl app.internal.example
    Created spider 'catalog_crawl' using template 'crawl' in module:
      catalogdemo.spiders.catalog_crawl

    The crawl template creates a CrawlSpider with placeholder rules so only the target patterns need to be filled in.

  3. Replace the generated spider with rules that follow listing pages and parse detail pages.
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
     
     
    class CatalogCrawlSpider(CrawlSpider):
        name = "catalog_crawl"
        allowed_domains = ["app.internal.example"]
        start_urls = ["https://app.internal.example/products/"]
     
        rules = (
            Rule(
                LinkExtractor(
                    allow=(r"/products/page/\d+/$",),
                    restrict_css=("main.catalog",),
                ),
                follow=True,
            ),
            Rule(
                LinkExtractor(
                    allow=(r"/products/[^/]+/$",),
                    restrict_css=("main.catalog",),
                ),
                callback="parse_item",
            ),
        )
     
        def parse_item(self, response):
            yield {
                "name": response.css("h1::text").get(),
                "price": response.css("p.price::text").get(),
                "url": response.url,
            }

    Use parse_start_url() when the first URL also needs extraction, because start-page responses are not sent to a rule callback automatically.

    Keep parse() untouched in a CrawlSpider, and set an explicit callback on any Request objects yielded from methods such as parse_item() to avoid unexpected dispatching.

  4. Narrow the extraction region or deny patterns before running the crawl if the site navigation contains unrelated links.
    Rule(
        LinkExtractor(
            allow=(r"/products/[^/]+/$",),
            deny=(r"/account/", r"/search/", r"/cart/"),
            restrict_css=("main.catalog",),
        ),
        callback="parse_item",
    )

    Rules are checked in order, so place the most specific matches first if two patterns can match the same URL.

  5. Run the spider with feed export enabled to confirm the rules follow pagination and item pages.
    $ scrapy crawl catalog_crawl -O catalog.json
    2026-04-16 13:33:51 [scrapy.utils.log] INFO: Scrapy 2.15.0 started (bot: catalogdemo)
    2026-04-16 13:33:52 [scrapy.core.engine] INFO: Spider opened
    ##### snipped #####
    2026-04-16 13:33:57 [scrapy.extensions.feedexport] INFO: Stored json feed (3 items) in: catalog.json
    2026-04-16 13:33:57 [scrapy.core.engine] INFO: Spider closed (finished)

    -O overwrites the export file on each run, while -o appends to an existing feed.

  6. Review the exported items to confirm the detail-page callback is returning the expected fields.
    $ cat catalog.json
    [
    {"name": "Team Plan", "price": "$79", "url": "https://app.internal.example/products/team-plan/"},
    {"name": "Starter Plan", "price": "$29", "url": "https://app.internal.example/products/starter-plan/"},
    {"name": "Enterprise Plan", "price": "$199", "url": "https://app.internal.example/products/enterprise-plan/"}
    ]

    If the export shows only listing URLs or misses deeper pages, tighten the LinkExtractor patterns or pair the spider with How to set a crawl depth limit in Scrapy while testing.