How to use CrawlSpider in Scrapy

Crawling sites with repeating navigation patterns is easier when link-following rules are declarative instead of scattered across callbacks. CrawlSpider fits catalog-style sites where listing pages, pagination, and item pages share predictable URL structures, keeping scrapes focused on extraction rather than request bookkeeping.

In Scrapy, CrawlSpider applies a sequence of Rule objects to each response. A Rule uses LinkExtractor patterns to select URLs to follow, optionally assigns a callback for matched pages, and controls continued crawling through the follow flag while still honoring offsite filtering from allowed_domains.

Rules that are too broad can explode crawl scope, revisit the same content through multiple URL variants, or enter loops through search, account, and tracking pages. Keep allowed_domains strict, start from a narrow start_urls entry point, avoid overriding parse in a CrawlSpider, and use parse_start_url when extraction is needed from the start pages.

Steps to use CrawlSpider in Scrapy:

Generate a CrawlSpider template for the target domain.

$ scrapy genspider -t crawl catalog_crawl app.internal.example
Created spider 'catalog_crawl' using template 'crawl' in module:
  simplifiedguide.spiders.catalog_crawl

The module path reflects the Scrapy project name.

Define CrawlSpider rules in the spider file.

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
 
 
class CatalogCrawlSpider(CrawlSpider):
    name = "catalog_crawl"
    allowed_domains = ["app.internal.example"]
    start_urls = ["http://app.internal.example:8000/products/"]
 
    rules = (
        Rule(
            LinkExtractor(
                allow=(r"/products(\?page=\d+)?$",),
            ),
            follow=True,
        ),
        Rule(
            LinkExtractor(
                allow=(r"/products/[^/]+\.html$",),
            ),
            callback="parse_item",
            follow=False,
        ),
    )
 
    def parse_item(self, response):
        yield {
            "name": response.css("h1::text").get(),
            "price": response.css("p.price::text").get(),
            "url": response.url,
        }

Use restrict_css or restrict_xpaths in LinkExtractor to ignore header and footer links when URL patterns still match too much.

Run the CrawlSpider with JSON export enabled.

$ scrapy crawl catalog_crawl -O catalog.json -s HTTPCACHE_ENABLED=False -s LOG_LEVEL=INFO -s DEPTH_LIMIT=0
2026-01-01 09:23:31 [scrapy.extensions.feedexport] INFO: Stored json feed (6 items) in: catalog.json

Use -O to overwrite an existing file, or -o to append.

Broad rules can trigger long crawls and heavy load on the target site.

Inspect the JSON output to confirm rule coverage.

$ head -n 5 catalog.json
[
{"name": "Starter Plan", "price": "$29", "url": "http://app.internal.example:8000/products/starter-plan.html"},
{"name": "Team Plan", "price": "$79", "url": "http://app.internal.example:8000/products/team-plan.html"},
{"name": "Enterprise Plan", "price": "$199", "url": "http://app.internal.example:8000/products/enterprise-plan.html"},
{"name": "Growth Plan", "price": "$129", "url": "http://app.internal.example:8000/products/growth-plan.html"},
##### snipped #####

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.