Scraping a site’s RSS feed provides a low-friction way to monitor updates, build content indexes, or trigger automations when new posts appear. Compared to scraping full HTML pages, feeds usually keep a stable structure even when the site layout changes.

Most RSS and Atom feeds are delivered as standardized XML documents, where each entry exposes predictable fields such as title, link, and pubDate (or updated in Atom). Scrapy fetches the feed over HTTP and exposes it as a selector-enabled response, so extraction uses the same XPath workflow used for normal pages.

Feeds commonly embed HTML inside elements like description, omit full bodies, or place richer content in namespaced fields such as content:encoded. Polling too frequently can trigger rate limits or blocks, so prefer reasonable intervals and caching for production monitors, and validate permitted reuse before republishing feed content.

Steps to scrape an RSS feed with Scrapy:

  1. Locate the RSS feed URL from the site homepage or its RSS icon.
  2. Open the feed URL to identify the entry element path.

    RSS 2.0 entries are typically under channel→item, while Atom entries are typically under feed→entry.

  3. Launch the Scrapy shell with the feed URL.
    $ scrapy shell http://app.internal.example:8000/rss.xml
    2026-01-01 09:15:37 [scrapy.utils.log] INFO: Scrapy 2.11.1 started (bot: simplifiedguide)
    ##### snipped #####
  4. Confirm the response status is 200.
    >>> response
    <200 http://app.internal.example:8000/rss.xml>

    301 or 302 indicates a redirect, while 403 or 429 commonly indicates blocking or rate limiting.

  5. Select feed entries with XPath.
    >>> posts = response.xpath('//channel/item')
  6. Confirm the selector returns entries.
    >>> len(posts)
    2
  7. Extract sample fields from the first and last entries.
    >>> posts[0].xpath('title/text()').get()
    'Launch Update'
    >>> posts[-1].xpath('title/text()').get()
    'Usage Tips'
  8. Iterate over each entry to extract the required fields.
    >>> for item in response.xpath('//channel/item'):
    ...     post = {
    ...         'title': item.xpath('title/text()').get(),
    ...         'link': item.xpath('link/text()').get(),
    ...         'pubDate': item.xpath('pubDate/text()').get(),
    ...     }
    ...     print(post)
    ...
    {'title': 'Launch Update', 'link': 'http://app.internal.example:8000/news/launch-update.html', 'pubDate': '2026-01-01 09:00:00 +0000'}
    {'title': 'Usage Tips', 'link': 'http://app.internal.example:8000/news/usage-tips.html', 'pubDate': '2025-12-15 09:00:00 +0000'}

    Full article bodies are commonly stored in description or content:encoded, which can include HTML and may require cleanup before reuse.

  9. Create a Scrapy spider from the validated selectors (optional).
    scrape_rss.py
    import scrapy
     
     
    class ScrapeRssSpider(scrapy.Spider):
        name = "scrape-rss"
        start_urls = ["http://app.internal.example:8000/rss.xml"]
     
        def parse(self, response):
            for post in response.xpath("//channel/item"):
                yield {
                    "title": post.xpath("title/text()").get(),
                    "link": post.xpath("link/text()").get(),
                    "pubDate": post.xpath("pubDate/text()").get(),
                }
  10. Run the spider to export items as JSON.
    $ scrapy runspider --nolog --output -:json scrape_rss.py
    [
    {"title": "Launch Update", "link": "http://app.internal.example:8000/news/launch-update.html", "pubDate": "2026-01-01 09:00:00 +0000"},
    {"title": "Usage Tips", "link": "http://app.internal.example:8000/news/usage-tips.html", "pubDate": "2025-12-15 09:00:00 +0000"}
    ]