Scraping an RSS feed with Scrapy is a reliable way to collect new posts, release notes, or monitoring events without reverse-engineering a full site layout. Feed documents usually change more slowly than HTML templates, so a feed-first workflow is often the lowest-maintenance path when the publisher exposes one.

Scrapy treats an RSS 2.0 feed as an XML response, which means the same shell, selector, and export workflow used for HTML pages still applies. For recurring jobs, XMLFeedSpider is the Scrapy spider class built for XML feeds, and it can iterate repeated item nodes efficiently while yielding normal item dictionaries for export.

RSS feeds often expose only titles, links, dates, and summaries in the main entry fields, while richer bodies may live in namespaced elements such as content:encoded and Atom feeds use entry and updated instead of item and pubDate. Validate the repeated entry node in the shell before writing the spider, remove namespaces only when it simplifies the XPath you need, and keep polling intervals reasonable so the feed is monitored without turning into unnecessary crawl traffic.

Steps to scrape an RSS feed with Scrapy:

  1. Locate the feed URL from the site homepage, page source, or visible RSS icon.
  2. Open the feed URL and identify the repeated entry node before building selectors.

    RSS 2.0 feeds usually repeat channel/item, while Atom feeds usually repeat feed/entry.

  3. Launch the Scrapy shell with the feed URL.
    $ scrapy shell https://updates.example.net/rss.xml
    [s] Available Scrapy objects:
    [s]   request    <GET https://updates.example.net/rss.xml>
    [s]   response   <200 https://updates.example.net/rss.xml>
    ##### snipped #####
    >>>
  4. Select the repeated item nodes from the feed response.
    >>> posts = response.xpath("//channel/item")
    >>> len(posts)
    2

    If the selector returns 0, the feed may be Atom instead of RSS 2.0, or the URL may have loaded an HTML landing page instead of the raw feed.

  5. Read the fields needed from one entry before writing the spider.
    >>> posts[0].xpath("title/text()").get()
    'Launch Update'
    >>> posts[0].xpath("link/text()").get()
    'https://updates.example.net/news/launch-update.html'
    >>> posts[0].xpath("pubDate/text()").get()
    'Wed, 16 Apr 2026 09:00:00 +0000'
  6. Remove XML namespaces only when a richer field is stored under a prefix such as content:encoded.
    >>> response.selector.remove_namespaces()
    >>> response.xpath("//item/encoded/text()").get()
    '<p>Full launch update body.</p>'

    After remove_namespaces(), the content:encoded element can be selected as encoded.

  7. Save a spider that iterates the feed entries and extracts the fields to export.
    rss_feed_spider.py
    from scrapy.spiders import XMLFeedSpider
     
    class RssFeedSpider(XMLFeedSpider):
        name = "rss-feed"
        allowed_domains = ["updates.example.net"]
        start_urls = ["https://updates.example.net/rss.xml"]
        itertag = "item"
        namespaces = [("content", "http://purl.org/rss/1.0/modules/content/")]
     
        def parse_node(self, response, node):
            yield {
                "title": node.xpath("title/text()").get(),
                "link": node.xpath("link/text()").get(),
                "pubDate": node.xpath("pubDate/text()").get(),
                "body_html": node.xpath("content:encoded/text()").get(),
            }
  8. Run the spider and export the parsed feed items.
    $ scrapy runspider --nolog -O -:json rss_feed_spider.py
    [
    {"title": "Launch Update", "link": "https://updates.example.net/news/launch-update.html", "pubDate": "Wed, 16 Apr 2026 09:00:00 +0000", "body_html": "<p>Full launch update body.</p>"},
    {"title": "Usage Tips", "link": "https://updates.example.net/news/usage-tips.html", "pubDate": "Tue, 15 Apr 2026 09:00:00 +0000", "body_html": "<p>Full usage tips body.</p>"}
    ]

    Replace -O -:json with -O rss-items.json when the items should be written to a file instead of stdout.