Scraping a site’s RSS feed provides a low-friction way to monitor updates, build content indexes, or trigger automations when new posts appear. Compared to scraping full HTML pages, feeds usually keep a stable structure even when the site layout changes.
Most RSS and Atom feeds are delivered as standardized XML documents, where each entry exposes predictable fields such as title, link, and pubDate (or updated in Atom). Scrapy fetches the feed over HTTP and exposes it as a selector-enabled response, so extraction uses the same XPath workflow used for normal pages.
Feeds commonly embed HTML inside elements like description, omit full bodies, or place richer content in namespaced fields such as content:encoded. Polling too frequently can trigger rate limits or blocks, so prefer reasonable intervals and caching for production monitors, and validate permitted reuse before republishing feed content.
Related: How to scrape an XML file with Scrapy
Related: How to export Scrapy items to JSON

RSS 2.0 entries are typically under channel→item, while Atom entries are typically under feed→entry.
$ scrapy shell http://app.internal.example:8000/rss.xml 2026-01-01 09:15:37 [scrapy.utils.log] INFO: Scrapy 2.11.1 started (bot: simplifiedguide) ##### snipped #####
>>> response <200 http://app.internal.example:8000/rss.xml>
301 or 302 indicates a redirect, while 403 or 429 commonly indicates blocking or rate limiting.
>>> posts = response.xpath('//channel/item')
>>> len(posts) 2
>>> posts[0].xpath('title/text()').get()
'Launch Update'
>>> posts[-1].xpath('title/text()').get()
'Usage Tips'
>>> for item in response.xpath('//channel/item'):
... post = {
... 'title': item.xpath('title/text()').get(),
... 'link': item.xpath('link/text()').get(),
... 'pubDate': item.xpath('pubDate/text()').get(),
... }
... print(post)
...
{'title': 'Launch Update', 'link': 'http://app.internal.example:8000/news/launch-update.html', 'pubDate': '2026-01-01 09:00:00 +0000'}
{'title': 'Usage Tips', 'link': 'http://app.internal.example:8000/news/usage-tips.html', 'pubDate': '2025-12-15 09:00:00 +0000'}
Full article bodies are commonly stored in description or content:encoded, which can include HTML and may require cleanup before reuse.
import scrapy class ScrapeRssSpider(scrapy.Spider): name = "scrape-rss" start_urls = ["http://app.internal.example:8000/rss.xml"] def parse(self, response): for post in response.xpath("//channel/item"): yield { "title": post.xpath("title/text()").get(), "link": post.xpath("link/text()").get(), "pubDate": post.xpath("pubDate/text()").get(), }
Related: How to create a Scrapy spider
$ scrapy runspider --nolog --output -:json scrape_rss.py
[
{"title": "Launch Update", "link": "http://app.internal.example:8000/news/launch-update.html", "pubDate": "2026-01-01 09:00:00 +0000"},
{"title": "Usage Tips", "link": "http://app.internal.example:8000/news/usage-tips.html", "pubDate": "2025-12-15 09:00:00 +0000"}
]