XML sitemaps publish the canonical URLs a site wants indexed. Seeding a Scrapy crawl from that list produces predictable coverage and avoids depending on fragile navigation links or a hand-written start_urls list.
Scrapy can fetch the sitemap XML and extract each loc entry inside a urlset or sitemapindex document. The built-in SitemapSpider schedules requests for matching URLs and maps URL patterns to callbacks using regular expressions.
Sitemaps are often split across multiple files, exposed via a sitemap index, and sometimes served as compressed .xml.gz resources. Ensure the sitemap URL and allowed_domains match the hostnames inside the sitemap (www vs apex domain), and expect a sitemap seed to generate a large initial request burst on big sites.
Related: How to scrape an XML file with Scrapy
Related: How to use CrawlSpider in Scrapy
$ vi simplifiedguide/spiders/sitemap_seed.py
from scrapy.spiders import SitemapSpider class SitemapSeedSpider(SitemapSpider): name = "sitemap_seed" allowed_domains = ["app.internal.example"] sitemap_urls = ["http://app.internal.example:8000/sitemap.xml"] sitemap_rules = [ (r"/news/", "parse_page"), (r"/products/", "parse_page"), ] def parse_page(self, response): yield { "title": response.css("h1::text").get(), "url": response.url, }
Match allowed_domains and sitemap_urls to the target site, and tighten the sitemap_rules pattern to constrain crawl scope.
$ scrapy crawl sitemap_seed -O pages.json -s HTTPCACHE_ENABLED=False -s LOG_LEVEL=INFO 2026-01-01 09:17:17 [scrapy.extensions.feedexport] INFO: Stored json feed (2 items) in: pages.json
Large sitemaps can enqueue thousands of requests quickly; insufficient throttling can overload the origin server and trigger rate limits or IP blocks.
$ head -n 4 pages.json
[
{"title": "News Archive", "url": "http://app.internal.example:8000/news/"},
{"title": "Products", "url": "http://app.internal.example:8000/products/"}
]