How to scrape an XML file with Scrapy

Scraping a direct XML file with Scrapy keeps the crawl on the structured source instead of brittle page markup, which is useful for catalog exports, partner feeds, and scheduled data drops. When the repeated record element stays stable, the extraction logic usually survives site redesigns better than HTML-only selectors.

Current Scrapy exposes a direct XML response as an XmlResponse in scrapy shell, so one quick shell check is enough to confirm the repeated node and the field paths before saving a spider. For the reusable crawl, XMLFeedSpider is the built-in spider class for iterating matching XML elements and passing each one to parse_node() for normal item extraction.

The XML URL still needs to return the raw file body instead of an HTML download page, login response, or redirect target, and files with default namespaces need either registered prefixes or a deliberate remove_namespaces() pass before plain XPath queries will match. Running scrapy runspider from inside an existing Scrapy project can also pull in that project's settings, middleware, and pipelines, so keep standalone XML tests in a neutral working directory when you want predictable results.

Steps to scrape an XML file with Scrapy using XMLFeedSpider:

  1. Check the XML source in Scrapy shell and confirm that the repeated node path returns the records you expect.
    $ scrapy shell --nolog https://files.example.net/data/products.xml -c '(type(response).__name__, response.xpath("//product/name/text()").getall())'
    ('XmlResponse', ['Starter Plan', 'Team Plan', 'Growth Plan'])

    The response should be the XML body itself, and if //product returns no matches on a namespaced file, test response.selector.remove_namespaces() in the shell first or register a prefix and query the namespaced path explicitly.

  2. Save a standalone XMLFeedSpider file with the repeated tag and one parse_node() callback.
    $ vi product_xml_spider.py
    product_xml_spider.py
    from scrapy.spiders import XMLFeedSpider
     
     
    class ProductXmlSpider(XMLFeedSpider):
        name = "product_xml"
        start_urls = ["https://files.example.net/data/products.xml"]
        iterator = "iternodes"
        itertag = "product"
     
        def parse_node(self, response, node):
            yield {
                "sku": node.xpath("@sku").get(),
                "name": node.xpath("name/text()").get(),
                "price": node.xpath("price/text()").get(),
                "url": node.xpath("url/text()").get(),
            }

    XMLFeedSpider uses iternodes efficiently for repeated XML elements, and namespaced files can keep their prefixes with namespaces = [(“n”, “https://example.com/catalog”)] plus itertag = “n:product”.

  3. Run the spider and overwrite the current export file.
    $ scrapy runspider product_xml_spider.py -O products.jsonl
    2026-04-22 05:50:18 [scrapy.utils.log] INFO: Scrapy 2.15.0 started (bot: scrapybot)
    ##### snipped #####
    2026-04-22 05:50:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://files.example.net/data/products.xml> (referer: None)
    2026-04-22 05:50:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://files.example.net/data/products.xml>
    {'sku': 'starter-001', 'name': 'Starter Plan', 'price': '29.00', 'url': 'https://shop.example.com/products/starter-plan.html'}
    2026-04-22 05:50:21 [scrapy.extensions.feedexport] INFO: Stored jsonl feed (3 items) in: products.jsonl
    2026-04-22 05:50:21 [scrapy.core.engine] INFO: Spider closed (finished)

    The .jsonl suffix selects JSON Lines export automatically, and -O replaces any existing products.jsonl file.

    A standalone spider started from inside a Scrapy project can inherit that project's settings, middleware, and pipelines instead of running with neutral defaults.

  4. Read the exported rows and confirm that each XML record became one Scrapy item.
    $ cat products.jsonl
    {"sku": "starter-001", "name": "Starter Plan", "price": "29.00", "url": "https://shop.example.com/products/starter-plan.html"}
    {"sku": "team-001", "name": "Team Plan", "price": "79.00", "url": "https://shop.example.com/products/team-plan.html"}
    {"sku": "growth-001", "name": "Growth Plan", "price": "129.00", "url": "https://shop.example.com/products/growth-plan.html"}

    Each line should contain one parsed item, which makes .jsonl easy to inspect, diff, or stream into later processing.