Catalog, archive, and documentation sites often repeat the same link shapes across listing pages and detail pages, so manual follow-up code quickly becomes repetitive and easy to broaden too far. CrawlSpider keeps that crawl flow readable by moving link discovery into ordered rules instead of hand-written follow logic on every response.
In current Scrapy releases, a CrawlSpider still combines Rule objects with LinkExtractor patterns so one rule can follow pagination while another sends matched detail pages to a callback. Rules are checked in order, the first matching rule claims each extracted link, and parse_start_url() remains the place to handle item extraction from the initial start_urls response when the seed page also contains data to keep.
Broad patterns can still crawl account pages, filtered search URLs, or other low-value branches, and OffsiteMiddleware drops followed requests when allowed_domains does not match the real host. Keep the extraction scope tight with options such as restrict_css or deny, leave parse() untouched in a CrawlSpider, and set an explicit callback on any new Request objects yielded from methods such as parse_item().
Steps to use CrawlSpider in Scrapy:
- Change to the root of the Scrapy project that will hold the spider.
$ cd catalogdemo
- Generate a CrawlSpider skeleton in the project's spiders module.
$ scrapy genspider -t crawl catalog_crawl app.internal.example Created spider 'catalog_crawl' using template 'crawl' in module: catalogdemo.spiders.catalog_crawl
The crawl template creates a CrawlSpider with placeholder rules so the spider only needs the real host patterns, extraction scope, and item callback.
- Replace the generated spider with rules that match the real host, pagination URLs, and detail pages.
- catalogdemo/spiders/catalog_crawl.py
from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class CatalogCrawlSpider(CrawlSpider): name = "catalog_crawl" allowed_domains = ["app.internal.example"] start_urls = ["https://app.internal.example/products/"] rules = ( Rule( LinkExtractor( allow=(r"/products/page/\d+/$",), restrict_css=("main.catalog",), ), follow=True, ), Rule( LinkExtractor( allow=(r"/products/[^/]+/$",), restrict_css=("main.catalog",), ), callback="parse_item", ), ) def parse_item(self, response): yield { "name": response.css("h1::text").get(), "price": response.css("p.price::text").get(), "url": response.url, }
Use parse_start_url() when the seed URL also needs extraction, because the initial start_urls response is not sent to a rule callback.
Keep parse() untouched in a CrawlSpider, and add an explicit callback to any new Request objects yielded from parse_item() or other methods so the response does not fall back into unexpected rule dispatch.
- Run the spider with feed export enabled to confirm the rules follow both listing pages and detail pages.
$ scrapy crawl catalog_crawl -O catalog.json 2026-04-22 05:55:41 [scrapy.utils.log] INFO: Scrapy 2.15.0 started (bot: catalogdemo) 2026-04-22 05:55:50 [scrapy.core.engine] INFO: Spider opened ##### snipped ##### 2026-04-22 05:55:57 [scrapy.extensions.feedexport] INFO: Stored json feed (3 items) in: catalog.json 2026-04-22 05:55:57 [scrapy.core.engine] INFO: Spider closed (finished)
-O overwrites the export file on each run, while -o appends to an existing feed.
- Read the exported items to confirm the callback is scraping only the intended detail pages.
$ cat catalog.json [ {"name": "Team Plan", "price": "$79", "url": "https://app.internal.example/products/team-plan/"}, {"name": "Starter Plan", "price": "$29", "url": "https://app.internal.example/products/starter-plan/"}, {"name": "Enterprise Plan", "price": "$199", "url": "https://app.internal.example/products/enterprise-plan/"} ]If the export includes listing URLs, account pages, or filtered search results, narrow the allow patterns, add deny rules, or tighten restrict_css so LinkExtractor only sees the part of the page that contains crawl targets.
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
