How to seed Scrapy start URLs from an XML sitemap

XML sitemaps publish the canonical URLs a site wants indexed. Seeding a Scrapy crawl from that list produces predictable coverage and avoids depending on fragile navigation links or a hand-written start_urls list.

Scrapy can fetch the sitemap XML and extract each loc entry inside a urlset or sitemapindex document. The built-in SitemapSpider schedules requests for matching URLs and maps URL patterns to callbacks using regular expressions.

Sitemaps are often split across multiple files, exposed via a sitemap index, and sometimes served as compressed .xml.gz resources. Ensure the sitemap URL and allowed_domains match the hostnames inside the sitemap (www vs apex domain), and expect a sitemap seed to generate a large initial request burst on big sites.

Steps to seed Scrapy start URLs from an XML sitemap:

Open the spider file for sitemap crawling.

$ vi simplifiedguide/spiders/sitemap_seed.py

Define a SitemapSpider that schedules requests only for matching sitemap URLs.

from scrapy.spiders import SitemapSpider
 
 
class SitemapSeedSpider(SitemapSpider):
    name = "sitemap_seed"
    allowed_domains = ["app.internal.example"]
    sitemap_urls = ["http://app.internal.example:8000/sitemap.xml"]
    sitemap_rules = [
        (r"/news/", "parse_page"),
        (r"/products/", "parse_page"),
    ]
 
    def parse_page(self, response):
        yield {
            "title": response.css("h1::text").get(),
            "url": response.url,
        }

Match allowed_domains and sitemap_urls to the target site, and tighten the sitemap_rules pattern to constrain crawl scope.

Run the spider with JSON export enabled.
```
$ scrapy crawl sitemap_seed -O pages.json -s HTTPCACHE_ENABLED=False -s LOG_LEVEL=INFO
2026-01-01 09:17:17 [scrapy.extensions.feedexport] INFO: Stored json feed (2 items) in: pages.json
```
Large sitemaps can enqueue thousands of requests quickly; insufficient throttling can overload the origin server and trigger rate limits or IP blocks.

Inspect the output to confirm URLs were captured.

$ head -n 4 pages.json
[
{"title": "News Archive", "url": "http://app.internal.example:8000/news/"},
{"title": "Products", "url": "http://app.internal.example:8000/products/"}
]

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.