Scraping XML endpoints provides structured data from feeds and APIs without relying on fragile page layouts, making automation and monitoring far more reliable than HTML-only scraping.
Scrapy downloads the XML document as a normal HTTP response and exposes it through a selector tree, so XPath can target elements, attributes, and text nodes to produce clean items that export directly to JSON or CSV.
Some XML sources include namespaces, CDATA blocks, embedded HTML, or very large documents, which can complicate selection and increase memory use, so validating selectors in the interactive shell and keeping item extraction minimal helps avoid slow crawls and incomplete exports.
Related: How to scrape an RSS feed with Scrapy
Related: How to scrape a JSON API with Scrapy
Example: http://files.example.net:8000/data/products.xml
RSS feeds, sitemaps, and some APIs expose structured data as XML.
$ scrapy shell "http://files.example.net:8000/data/products.xml" 2026-01-01 09:05:19 [scrapy.utils.log] INFO: Scrapy 2.11.1 started (bot: simplifiedguide) ##### snipped #####
>>> response <200 http://files.example.net:8000/data/products.xml>
>>> response.xpath('//product[1]').get()
'<product>
<name>Starter Plan</name>
<price>$29</price>
<url>http://app.internal.example:8000/products/starter-plan.html</url>
</product>'
Use local-name() when the document declares a default namespace.
>>> product_nodes = response.xpath('//product')
>>>> len(product_nodes)
3
>>> product_nodes[0].xpath('name/text()').get()
'Starter Plan'
>>>> product_nodes[0].xpath('price/text()').get()
'$29'
>>>> product_nodes[0].xpath('url/text()').get()
'http://app.internal.example:8000/products/starter-plan.html'
>>> for n in product_nodes[:3]:
... print({
... 'name': n.xpath('name/text()').get(),
... 'price': n.xpath('price/text()').get(),
... 'url': n.xpath('url/text()').get(),
... })
...
{'name': 'Starter Plan', 'price': '$29', 'url': 'http://app.internal.example:8000/products/starter-plan.html'}
{'name': 'Team Plan', 'price': '$79', 'url': 'http://app.internal.example:8000/products/team-plan.html'}
{'name': 'Enterprise Plan', 'price': '$199', 'url': 'http://app.internal.example:8000/products/enterprise-plan.html'}
import scrapy class ScrapeXmlSpider(scrapy.Spider): name = 'scrape-xml' start_urls = [ 'http://files.example.net:8000/data/products.xml', ] def parse(self, response): for node in response.xpath('//product'): yield { 'name': node.xpath('name/text()').get(), 'price': node.xpath('price/text()').get(), 'url': node.xpath('url/text()').get(), }
Related: How to create a Scrapy spider
$ scrapy runspider --nolog -O products.json scrape_xml.py
Aggressive crawling or repeatedly downloading large XML files can trigger rate limiting or temporary blocks.
$ head -n 8 products.json
[
{"name": "Starter Plan", "price": "$29", "url": "http://app.internal.example:8000/products/starter-plan.html"},
{"name": "Team Plan", "price": "$79", "url": "http://app.internal.example:8000/products/team-plan.html"},
{"name": "Enterprise Plan", "price": "$199", "url": "http://app.internal.example:8000/products/enterprise-plan.html"}
]