Scraping a direct XML file keeps the crawl on the structured source instead of on brittle page markup, which is useful for catalog exports, partner feeds, and scheduled data drops. When the repeating record element stays stable, extraction logic usually survives site redesigns better than HTML-only selectors.
Current Scrapy still exposes an XML source as an XmlResponse in scrapy shell, so one quick shell check is enough to confirm the repeating node and field paths. For the reusable crawl, XMLFeedSpider remains the XML-specific spider that iterates matching elements and hands each one to parse_node() for item extraction.
The URL must return the raw XML body itself rather than an HTML landing page or login response, and files with default namespaces need explicit prefix handling before plain XPath queries will match. Running scrapy runspider from inside an existing project can also pull in that project's settings, middleware, and pipelines, so keep standalone XML tests in a neutral working directory when you need predictable output.
Related: How to scrape an RSS feed with Scrapy
Related: How to scrape a JSON API with Scrapy
Steps to scrape an XML file with Scrapy using XMLFeedSpider:
- Check the XML source in Scrapy shell and confirm that the repeating node path returns the expected records.
$ scrapy shell "file:///srv/feeds/products.xml" --nolog -c '(type(response).__name__, response.xpath("//product/name/text()").getall())' ('XmlResponse', ['Starter Plan', 'Team Plan', 'Growth Plan'])Replace the file:// URI with an https:// URL when the XML file is hosted remotely instead of on local storage.
If the XML declares a default namespace and //product returns no matches, try response.selector.remove_namespaces() in the shell for quick exploration before you lock in the spider selectors.
- Create a standalone XMLFeedSpider file with the record tag and one parse_node() callback.
$ vi product_xml_spider.py
- product_xml_spider.py
from scrapy.spiders import XMLFeedSpider class ProductXmlSpider(XMLFeedSpider): name = "product_xml" start_urls = ["file:///srv/feeds/products.xml"] iterator = "iternodes" itertag = "product" def parse_node(self, response, node): yield { "sku": node.xpath("@sku").get(), "name": node.xpath("name/text()").get(), "price": node.xpath("price/text()").get(), "url": node.xpath("url/text()").get(), }
Add namespaces = [(“n”, “https://example.com/catalog”)] and change itertag = “n:product” when the file uses an XML namespace.
Related: How to create a Scrapy spider
- Run the spider and overwrite the export file for the current crawl.
$ scrapy runspider product_xml_spider.py -O products.jsonl 2026-04-16 05:59:39 [scrapy.utils.log] INFO: Scrapy 2.15.0 started (bot: scrapybot) ##### snipped ##### 2026-04-16 05:59:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET file:///srv/feeds/products.xml> (referer: None) 2026-04-16 05:59:40 [scrapy.extensions.feedexport] INFO: Stored jsonl feed (3 items) in: products.jsonl 2026-04-16 05:59:40 [scrapy.core.engine] INFO: Spider closed (finished)
Current Scrapy still uses -O to overwrite the target output file, which keeps each XML scrape run self-contained.
A standalone spider started from inside a Scrapy project can inherit that project's settings, middleware, and pipelines instead of running with neutral defaults.
- Read the exported items to confirm that each XML record became one Scrapy item.
$ cat products.jsonl {"sku": "starter-001", "name": "Starter Plan", "price": "29.00", "url": "https://shop.example.com/products/starter-plan.html"} {"sku": "team-001", "name": "Team Plan", "price": "79.00", "url": "https://shop.example.com/products/team-plan.html"} {"sku": "growth-001", "name": "Growth Plan", "price": "129.00", "url": "https://shop.example.com/products/growth-plan.html"}Use .jsonl when you want one JSON object per line that is easy to inspect, diff, or stream into later processing.
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
