An XML sitemap is often the cleanest crawl seed because it lists the canonical URLs a site already publishes. Starting a Scrapy crawl from that file avoids brittle navigation paths, reduces missed sections, and keeps the first request wave aligned with the site's own inventory.
Current Scrapy releases use SitemapSpider with sitemap_urls to fetch a sitemap URL or a robots.txt file, follow sitemap index files, and dispatch matching page URLs through ordered sitemap_rules. A sitemap_follow pattern limits which child sitemaps are opened when the entry file is a sitemap index.
Each sitemap <loc> entry must be a full URL on a host permitted by allowed_domains, and broad sitemap indexes can schedule a large number of requests quickly. Compressed .xml.gz sitemap responses are still supported, but large sitemap bodies can still hit the crawler's download size limits before any page callback runs.
Related: How to scrape an XML file with Scrapy
Related: How to use CrawlSpider in Scrapy
$ cd shopdemo
from scrapy.spiders import SitemapSpider class SitemapSeedSpider(SitemapSpider): name = "sitemap_seed" allowed_domains = ["shop.example"] sitemap_urls = [ "https://shop.example/sitemap.xml", ] sitemap_follow = [ r"/products\\.xml$", ] sitemap_rules = [ (r"/products/", "parse_product"), ] def parse_product(self, response): name = response.css("h1::text").get() yield {"name": name}
Relative <loc> values are invalid in XML sitemaps, and hostnames outside allowed_domains are filtered before parse_product runs.
$ scrapy crawl sitemap_seed -O pages.jsonl ##### snipped ##### Stored jsonl feed (2 items) in: pages.jsonl Spider closed (finished)
-O overwrites the existing export file so repeated sitemap tests do not append stale rows.
$ cat pages.jsonl
{"name": "Starter Plan"}
{"name": "Team Plan"}
If the export is empty or includes the wrong section, re-check the sitemap URL, the sitemap_follow pattern for index files, and the hostnames listed under allowed_domains.