An XML sitemap is often the cleanest crawl seed because it lists the canonical URLs a site already publishes. Starting a Scrapy crawl from that file avoids brittle navigation paths, reduces missed sections, and keeps the first request wave aligned with the site's own inventory.
Current Scrapy releases use SitemapSpider with sitemap_urls to fetch a sitemap URL or a robots.txt file, follow sitemap index files, and dispatch matching page URLs through ordered sitemap_rules. A sitemap_follow pattern limits which child sitemaps are opened when the entry file is a sitemap index.
Each sitemap <loc> entry must be a full URL on a host permitted by allowed_domains, and broad sitemap indexes can schedule a large number of requests quickly. Compressed .xml.gz sitemap responses are still supported, but large sitemap bodies can still hit the crawler's download size limits before any page callback runs.
Related: How to scrape an XML file with Scrapy
Related: How to use CrawlSpider in Scrapy
Steps to seed Scrapy start URLs from an XML sitemap:
- Change to the root of the Scrapy project that will run the sitemap seed spider.
$ cd shopdemo
- Create the spider in shopdemo/spiders so the sitemap index routes product URLs to one callback.
- sitemap_seed.py
from scrapy.spiders import SitemapSpider class SitemapSeedSpider(SitemapSpider): name = "sitemap_seed" allowed_domains = ["shop.example"] sitemap_urls = [ "https://shop.example/sitemap.xml", ] sitemap_follow = [ r"/products\\.xml$", ] sitemap_rules = [ (r"/products/", "parse_product"), ] def parse_product(self, response): name = response.css("h1::text").get() yield {"name": name}
Relative <loc> values are invalid in XML sitemaps, and hostnames outside allowed_domains are filtered before parse_product runs.
- Run the spider and overwrite the current export file.
$ scrapy crawl sitemap_seed -O pages.jsonl ##### snipped ##### Stored jsonl feed (2 items) in: pages.jsonl Spider closed (finished)
-O overwrites the existing export file so repeated sitemap tests do not append stale rows.
- Read the exported items and confirm that the crawl queued only the intended product pages.
$ cat pages.jsonl {"name": "Starter Plan"} {"name": "Team Plan"}If the export is empty or includes the wrong section, re-check the sitemap URL, the sitemap_follow pattern for index files, and the hostnames listed under allowed_domains.
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
