An XML sitemap is often the cleanest crawl seed because it lists the canonical URLs a site already wants discovered. Starting a Scrapy crawl from that source avoids brittle navigation paths, reduces missed sections, and keeps the first request set aligned with the site's own published inventory.

In current Scrapy releases, SitemapSpider still reads URLs from sitemap_urls, accepts a direct sitemap URL or a robots.txt URL, follows sitemap index files, and dispatches matching page URLs through ordered sitemap_rules. When a sitemap index points to many child files, sitemap_follow can keep the crawl on only the sitemap branches that belong to the current job.

Sitemap <loc> entries must be full crawlable URLs on the same host, not relative paths, and allowed_domains should match the hostnames that actually appear inside the sitemap. Large sitemap seeds can enqueue many requests quickly, and current Scrapy still accepts compressed .xml.gz sitemap responses, so keep patterns narrow and apply crawl throttling when the sitemap covers a large site.

Steps to seed Scrapy start URLs from an XML sitemap:

  1. Change to the root of the Scrapy project that will run the sitemap seed spider.
    $ cd shopdemo
  2. Create a SitemapSpider that points to the sitemap entry URL and maps matching page URLs to one callback.
    shopdemo/spiders/sitemap_seed.py
    from scrapy.spiders import SitemapSpider
     
     
    class SitemapSeedSpider(SitemapSpider):
        name = "sitemap_seed"
        allowed_domains = ["shop.example"]
        sitemap_urls = ["https://shop.example/sitemap.xml"]
        sitemap_follow = [r"/products\\.xml$"]
        sitemap_rules = [
            (r"/products/", "parse_product"),
        ]
     
        def parse_product(self, response):
            yield {
                "name": response.css("h1::text").get(),
                "sku": response.css("p.sku::text").re_first(r"SKU:\\s*(.+)"),
                "url": response.url,
            }

    Use a robots.txt URL in sitemap_urls when the site publishes sitemap locations only there, and remove sitemap_follow when the seed file is already one urlset instead of a sitemap index.

    The sitemap itself must publish complete same-host URLs in each <loc> entry, because relative paths are not valid sitemap URLs and host mismatches can be blocked before the callback ever runs.

  3. Run the spider and overwrite the export file for the current sitemap seed crawl.
    $ scrapy crawl sitemap_seed -O pages.jsonl -s LOG_LEVEL=INFO
    2026-04-16 06:41:37 [scrapy.utils.log] INFO: Scrapy 2.15.0 started (bot: shopdemo)
    ##### snipped #####
    2026-04-16 06:41:42 [scrapy.extensions.feedexport] INFO: Stored jsonl feed (2 items) in: pages.jsonl
    2026-04-16 06:41:42 [scrapy.core.engine] INFO: Spider closed (finished)

    -O overwrites the feed target on each run, which keeps repeated sitemap tests predictable while selectors and rules are still being tuned.

    sitemap_rules are checked in order, so place the narrowest URL patterns first when one page URL could match more than one callback.

  4. Read the exported items to confirm that the sitemap scheduled only the intended product pages.
    $ cat pages.jsonl
    {"name": "Team Plan", "sku": "team-001", "url": "https://shop.example/products/team/"}
    {"name": "Starter Plan", "sku": "starter-001", "url": "https://shop.example/products/starter/"}

    If the export is empty or includes the wrong sections, re-check the sitemap URL, the sitemap_follow patterns for index files, and the hostnames listed under allowed_domains.