Crawling sites with repeating navigation patterns is easier when link-following rules are declarative instead of scattered across callbacks. CrawlSpider fits catalog-style sites where listing pages, pagination, and item pages share predictable URL structures, keeping scrapes focused on extraction rather than request bookkeeping.

In Scrapy, CrawlSpider applies a sequence of Rule objects to each response. A Rule uses LinkExtractor patterns to select URLs to follow, optionally assigns a callback for matched pages, and controls continued crawling through the follow flag while still honoring offsite filtering from allowed_domains.

Rules that are too broad can explode crawl scope, revisit the same content through multiple URL variants, or enter loops through search, account, and tracking pages. Keep allowed_domains strict, start from a narrow start_urls entry point, avoid overriding parse in a CrawlSpider, and use parse_start_url when extraction is needed from the start pages.

Steps to use CrawlSpider in Scrapy:

  1. Generate a CrawlSpider template for the target domain.
    $ scrapy genspider -t crawl catalog_crawl app.internal.example
    Created spider 'catalog_crawl' using template 'crawl' in module:
      simplifiedguide.spiders.catalog_crawl

    The module path reflects the Scrapy project name.

  2. Define CrawlSpider rules in the spider file.
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
     
     
    class CatalogCrawlSpider(CrawlSpider):
        name = "catalog_crawl"
        allowed_domains = ["app.internal.example"]
        start_urls = ["http://app.internal.example:8000/products/"]
     
        rules = (
            Rule(
                LinkExtractor(
                    allow=(r"/products(\?page=\d+)?$",),
                ),
                follow=True,
            ),
            Rule(
                LinkExtractor(
                    allow=(r"/products/[^/]+\.html$",),
                ),
                callback="parse_item",
                follow=False,
            ),
        )
     
        def parse_item(self, response):
            yield {
                "name": response.css("h1::text").get(),
                "price": response.css("p.price::text").get(),
                "url": response.url,
            }

    Use restrict_css or restrict_xpaths in LinkExtractor to ignore header and footer links when URL patterns still match too much.

  3. Run the CrawlSpider with JSON export enabled.
    $ scrapy crawl catalog_crawl -O catalog.json -s HTTPCACHE_ENABLED=False -s LOG_LEVEL=INFO -s DEPTH_LIMIT=0
    2026-01-01 09:23:31 [scrapy.extensions.feedexport] INFO: Stored json feed (6 items) in: catalog.json

    Use -O to overwrite an existing file, or -o to append.

    Broad rules can trigger long crawls and heavy load on the target site.

  4. Inspect the JSON output to confirm rule coverage.
    $ head -n 5 catalog.json
    [
    {"name": "Starter Plan", "price": "$29", "url": "http://app.internal.example:8000/products/starter-plan.html"},
    {"name": "Team Plan", "price": "$79", "url": "http://app.internal.example:8000/products/team-plan.html"},
    {"name": "Enterprise Plan", "price": "$199", "url": "http://app.internal.example:8000/products/enterprise-plan.html"},
    {"name": "Growth Plan", "price": "$129", "url": "http://app.internal.example:8000/products/growth-plan.html"},
    ##### snipped #####