Crawling sites with repeating navigation patterns is easier when link-following rules are declarative instead of scattered across callbacks. CrawlSpider fits catalog-style sites where listing pages, pagination, and item pages share predictable URL structures, keeping scrapes focused on extraction rather than request bookkeeping.
In Scrapy, CrawlSpider applies a sequence of Rule objects to each response. A Rule uses LinkExtractor patterns to select URLs to follow, optionally assigns a callback for matched pages, and controls continued crawling through the follow flag while still honoring offsite filtering from allowed_domains.
Rules that are too broad can explode crawl scope, revisit the same content through multiple URL variants, or enter loops through search, account, and tracking pages. Keep allowed_domains strict, start from a narrow start_urls entry point, avoid overriding parse in a CrawlSpider, and use parse_start_url when extraction is needed from the start pages.
Steps to use CrawlSpider in Scrapy:
- Generate a CrawlSpider template for the target domain.
$ scrapy genspider -t crawl catalog_crawl app.internal.example Created spider 'catalog_crawl' using template 'crawl' in module: simplifiedguide.spiders.catalog_crawl
The module path reflects the Scrapy project name.
- Define CrawlSpider rules in the spider file.
from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class CatalogCrawlSpider(CrawlSpider): name = "catalog_crawl" allowed_domains = ["app.internal.example"] start_urls = ["http://app.internal.example:8000/products/"] rules = ( Rule( LinkExtractor( allow=(r"/products(\?page=\d+)?$",), ), follow=True, ), Rule( LinkExtractor( allow=(r"/products/[^/]+\.html$",), ), callback="parse_item", follow=False, ), ) def parse_item(self, response): yield { "name": response.css("h1::text").get(), "price": response.css("p.price::text").get(), "url": response.url, }
Use restrict_css or restrict_xpaths in LinkExtractor to ignore header and footer links when URL patterns still match too much.
- Run the CrawlSpider with JSON export enabled.
$ scrapy crawl catalog_crawl -O catalog.json -s HTTPCACHE_ENABLED=False -s LOG_LEVEL=INFO -s DEPTH_LIMIT=0 2026-01-01 09:23:31 [scrapy.extensions.feedexport] INFO: Stored json feed (6 items) in: catalog.json
Use -O to overwrite an existing file, or -o to append.
Broad rules can trigger long crawls and heavy load on the target site.
- Inspect the JSON output to confirm rule coverage.
$ head -n 5 catalog.json [ {"name": "Starter Plan", "price": "$29", "url": "http://app.internal.example:8000/products/starter-plan.html"}, {"name": "Team Plan", "price": "$79", "url": "http://app.internal.example:8000/products/team-plan.html"}, {"name": "Enterprise Plan", "price": "$199", "url": "http://app.internal.example:8000/products/enterprise-plan.html"}, {"name": "Growth Plan", "price": "$129", "url": "http://app.internal.example:8000/products/growth-plan.html"}, ##### snipped #####
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
