XML sitemaps publish the canonical URLs a site wants indexed. Seeding a Scrapy crawl from that list produces predictable coverage and avoids depending on fragile navigation links or a hand-written start_urls list.
Scrapy can fetch the sitemap XML and extract each loc entry inside a urlset or sitemapindex document. The built-in SitemapSpider schedules requests for matching URLs and maps URL patterns to callbacks using regular expressions.
Sitemaps are often split across multiple files, exposed via a sitemap index, and sometimes served as compressed .xml.gz resources. Ensure the sitemap URL and allowed_domains match the hostnames inside the sitemap (www vs apex domain), and expect a sitemap seed to generate a large initial request burst on big sites.
Related: How to scrape an XML file with Scrapy
Related: How to use CrawlSpider in Scrapy
Steps to seed Scrapy start URLs from an XML sitemap:
- Open the spider file for sitemap crawling.
$ vi simplifiedguide/spiders/sitemap_seed.py
- Define a SitemapSpider that schedules requests only for matching sitemap URLs.
from scrapy.spiders import SitemapSpider class SitemapSeedSpider(SitemapSpider): name = "sitemap_seed" allowed_domains = ["app.internal.example"] sitemap_urls = ["http://app.internal.example:8000/sitemap.xml"] sitemap_rules = [ (r"/news/", "parse_page"), (r"/products/", "parse_page"), ] def parse_page(self, response): yield { "title": response.css("h1::text").get(), "url": response.url, }
Match allowed_domains and sitemap_urls to the target site, and tighten the sitemap_rules pattern to constrain crawl scope.
- Run the spider with JSON export enabled.
$ scrapy crawl sitemap_seed -O pages.json -s HTTPCACHE_ENABLED=False -s LOG_LEVEL=INFO 2026-01-01 09:17:17 [scrapy.extensions.feedexport] INFO: Stored json feed (2 items) in: pages.json
Large sitemaps can enqueue thousands of requests quickly; insufficient throttling can overload the origin server and trigger rate limits or IP blocks.
- Inspect the output to confirm URLs were captured.
$ head -n 4 pages.json [ {"title": "News Archive", "url": "http://app.internal.example:8000/news/"}, {"title": "Products", "url": "http://app.internal.example:8000/products/"} ]
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
