XML sitemaps publish the canonical URLs a site wants indexed. Seeding a Scrapy crawl from that list produces predictable coverage and avoids depending on fragile navigation links or a hand-written start_urls list.

Scrapy can fetch the sitemap XML and extract each loc entry inside a urlset or sitemapindex document. The built-in SitemapSpider schedules requests for matching URLs and maps URL patterns to callbacks using regular expressions.

Sitemaps are often split across multiple files, exposed via a sitemap index, and sometimes served as compressed .xml.gz resources. Ensure the sitemap URL and allowed_domains match the hostnames inside the sitemap (www vs apex domain), and expect a sitemap seed to generate a large initial request burst on big sites.

Steps to seed Scrapy start URLs from an XML sitemap:

  1. Open the spider file for sitemap crawling.
    $ vi simplifiedguide/spiders/sitemap_seed.py
  2. Define a SitemapSpider that schedules requests only for matching sitemap URLs.
    from scrapy.spiders import SitemapSpider
     
     
    class SitemapSeedSpider(SitemapSpider):
        name = "sitemap_seed"
        allowed_domains = ["app.internal.example"]
        sitemap_urls = ["http://app.internal.example:8000/sitemap.xml"]
        sitemap_rules = [
            (r"/news/", "parse_page"),
            (r"/products/", "parse_page"),
        ]
     
        def parse_page(self, response):
            yield {
                "title": response.css("h1::text").get(),
                "url": response.url,
            }

    Match allowed_domains and sitemap_urls to the target site, and tighten the sitemap_rules pattern to constrain crawl scope.

  3. Run the spider with JSON export enabled.
    $ scrapy crawl sitemap_seed -O pages.json -s HTTPCACHE_ENABLED=False -s LOG_LEVEL=INFO
    2026-01-01 09:17:17 [scrapy.extensions.feedexport] INFO: Stored json feed (2 items) in: pages.json

    Large sitemaps can enqueue thousands of requests quickly; insufficient throttling can overload the origin server and trigger rate limits or IP blocks.

  4. Inspect the output to confirm URLs were captured.
    $ head -n 4 pages.json
    [
    {"title": "News Archive", "url": "http://app.internal.example:8000/news/"},
    {"title": "Products", "url": "http://app.internal.example:8000/products/"}
    ]