How to scrape an XML file with Scrapy

Scraping XML endpoints provides structured data from feeds and APIs without relying on fragile page layouts, making automation and monitoring far more reliable than HTML-only scraping.

Scrapy downloads the XML document as a normal HTTP response and exposes it through a selector tree, so XPath can target elements, attributes, and text nodes to produce clean items that export directly to JSON or CSV.

Some XML sources include namespaces, CDATA blocks, embedded HTML, or very large documents, which can complicate selection and increase memory use, so validating selectors in the interactive shell and keeping item extraction minimal helps avoid slow crawls and incomplete exports.

Steps to scrape an XML file with Scrapy:

Copy the XML file URL to scrape.
```
Example: http://files.example.net:8000/data/products.xml
```
RSS feeds, sitemaps, and some APIs expose structured data as XML.

Open Scrapy shell with the XML URL.

$ scrapy shell "http://files.example.net:8000/data/products.xml"
2026-01-01 09:05:19 [scrapy.utils.log] INFO: Scrapy 2.11.1 started (bot: simplifiedguide)
##### snipped #####

Confirm the HTTP response status is 200.

>>> response
<200 http://files.example.net:8000/data/products.xml>

Select one repeating record element to validate the XPath path.

>>> response.xpath('//product[1]').get()
'<product>
    <name>Starter Plan</name>
    <price>$29</price>
    <url>http://app.internal.example:8000/products/starter-plan.html</url>
  </product>'

Use local-name() when the document declares a default namespace.

Select all repeating record elements into a list.

>>> product_nodes = response.xpath('//product')
>>>> len(product_nodes)
3

Extract a few fields from the first record element.

>>> product_nodes[0].xpath('name/text()').get()
'Starter Plan'
>>>> product_nodes[0].xpath('price/text()').get()
'$29'
>>>> product_nodes[0].xpath('url/text()').get()
'http://app.internal.example:8000/products/starter-plan.html'

Print a few parsed items in the shell to confirm the output structure.

>>> for n in product_nodes[:3]:
...     print({
...         'name': n.xpath('name/text()').get(),
...         'price': n.xpath('price/text()').get(),
...         'url': n.xpath('url/text()').get(),
...     })
...
{'name': 'Starter Plan', 'price': '$29', 'url': 'http://app.internal.example:8000/products/starter-plan.html'}
{'name': 'Team Plan', 'price': '$79', 'url': 'http://app.internal.example:8000/products/team-plan.html'}
{'name': 'Enterprise Plan', 'price': '$199', 'url': 'http://app.internal.example:8000/products/enterprise-plan.html'}

Create a spider that yields items using the validated selectors.

scrape_xml.py

import scrapy
 
 
class ScrapeXmlSpider(scrapy.Spider):
    name = 'scrape-xml'
    start_urls = [
        'http://files.example.net:8000/data/products.xml',
    ]
 
    def parse(self, response):
        for node in response.xpath('//product'):
            yield {
                'name': node.xpath('name/text()').get(),
                'price': node.xpath('price/text()').get(),
                'url': node.xpath('url/text()').get(),
            }

Related: How to create a Scrapy spider

Run the spider with JSON feed export to a file.
```
$ scrapy runspider --nolog -O products.json scrape_xml.py
```
Aggressive crawling or repeatedly downloading large XML files can trigger rate limiting or temporary blocks.

Open the exported JSON file to verify parsed fields and values.

$ head -n 8 products.json
[
{"name": "Starter Plan", "price": "$29", "url": "http://app.internal.example:8000/products/starter-plan.html"},
{"name": "Team Plan", "price": "$79", "url": "http://app.internal.example:8000/products/team-plan.html"},
{"name": "Enterprise Plan", "price": "$199", "url": "http://app.internal.example:8000/products/enterprise-plan.html"}
]

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.