RSS is specifically designed for applications to access websites in an easily readable format. Users could then use these applications to programmatically access these websites.

RSS normally contain snippets of the latest website content and is in a standardised XML format. It is therefore one of the best point for a web scraper such as Scrapy to get the latest update of a website.

News sites and blogs normally provide RSS feed and will normally provide a link to the feed using the official RSS icon.

You can monitor when a website is updated and get the latest content by scraping the RSS feed using Scrapy.

Steps to scrape RSS feed using Scrapy:

  1. Navigate to the site via web browser and search for RSS feed link or icon.
  2. Click on the RSS link to view and examine the RSS feed.

    Notice that it's basically an XML document. Blog posts are in channel→item elements.

  3. Open Scrapy shell at the command line with the RSS feed URL as an argument.
    $ scrapy shell https://www.blog.google/rss
    2020-05-26 00:28:58 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrapybot)
    2020-05-26 00:28:58 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 18.9.0, Python 3.8.2 (default, Apr 27 2020, 15:53:34) - [GCC 9.3.0], pyOpenSSL 19.0.0 (OpenSSL 1.1.1f  31 Mar 2020), cryptography 2.8, Platform Linux-5.4.0-31-generic-x86_64-with-glibc2.29
    2020-05-26 00:28:58 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
    2020-05-26 00:28:58 [scrapy.crawler] INFO: Overridden settings:
    {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
     'LOGSTATS_INTERVAL': 0}
    ##### snipped
  4. Check HTTP response status and make sure it returns 200.
    In [1]: response
    Out[1]: <200 https://www.blog.google/rss>
  5. Search for blog posts with XPath based on the structure.
    In [2]: posts = response.xpath('//channel/item')
  6. Check returned item count to confirm.
    >In [3]: len(posts)
    Out[3]: 20
  7. Extract an element from the first and last item to test.
    In [4]: posts[0].xpath('title/text()').extract()
    Out[4]: ['Find wheelchair accessible places with Google Maps']
    
    In [5]: posts[19].xpath('title/text()').extract()
    Out[5]: ['Stay "connected to culture" on International Museum Day']
  8. Iterate through each item to get all required data.
    In [6]: for item in response.xpath('//channel/item'):
       ...:     post = {
       ...:         'title' : item.xpath('title//text()').extract_first(),
       ...:         'link': item.xpath('link//text()').extract_first(),
       ...:         'pubDate' : item.xpath('pubDate//text()').extract_first(),
       ...:     }
       ...:     print(post)
       ...:
    {'title': 'Find wheelchair accessible places with Google Maps', 'link': 'https://www.blog.google/products/maps/wheelchair-accessible-places-google-maps/', 'pubDate': 'Thu, 21 May 2020 16:30:00 +0000'}
    {'title': 'Accessibility updates that help tech work for everyone', 'link': 'https://www.blog.google/products/android/accessibility-updates-help-tech-work-everyone/', 'pubDate': 'Thu, 21 May 2020 16:30:00 +0000'}
    {'title': 'A is for accessibility: How to make remote learning work for everyone', 'link': 'https://www.blog.google/outreach-initiatives/education/global-accessibility-awareness-day-2020/', 'pubDate': 'Thu, 21 May 2020 16:30:00 +0000'}
    {'title': 'Navigating the road ahead: The benefits of real-time marketing', 'link': 'https://www.blog.google/products/ads/real-time-marketing/', 'pubDate': 'Thu, 21 May 2020 15:00:00 +0000'}
    {'title': 'Helping COVID-19 responders find hotels', 'link': 'https://www.blog.google/products/flights-hotels/covid-19-responder-hotel-rooms/', 'pubDate': 'Thu, 21 May 2020 14:00:00 +0000'}
    {'title': 'A reintroduction to our Knowledge Graph and  knowledge panels', 'link': 'https://www.blog.google/products/search/about-knowledge-graph-and-knowledge-panels/', 'pubDate': 'Wed, 20 May 2020 17:00:00 +0000'}
    {'title': 'Exposure Notification API launches to support public health agencies', 'link': 'https://www.blog.google/inside-google/company-announcements/apple-google-exposure-notification-api-launches/', 'pubDate': 'Wed, 20 May 2020 17:00:00 +0000'}
    {'title': 'A Doodle dedicated to the aloha spirit', 'link': 'https://www.blog.google/inside-google/doodles/israel-kamakawiwoole-apahm/', 'pubDate': 'Tue, 19 May 2020 22:00:00 +0000'}
    {'title': 'More intuitive privacy and security controls in Chrome', 'link': 'https://www.blog.google/products/chrome/more-intuitive-privacy-and-security-controls-chrome/', 'pubDate': 'Tue, 19 May 2020 16:00:00 +0000'}
    {'title': 'New controls for how you share albums in Google Photos', 'link': 'https://www.blog.google/products/photos/new-controls-how-you-share-albums-google-photos/', 'pubDate': 'Tue, 19 May 2020 15:00:00 +0000'}
    {'title': 'Deliver the best ad experience every time', 'link': 'https://www.blog.google/products/admanager/deliver-best-ad-experience-every-time/', 'pubDate': 'Tue, 19 May 2020 14:47:00 +0000'}
    {'title': 'How Search Works', 'link': 'https://www.blog.google/products/search/how-search-works/', 'pubDate': 'Tue, 19 May 2020 13:00:00 +0000'}
    {'title': 'Make the best of YouTube yours with YouTube Select', 'link': 'https://www.blog.google/products/ads/introducing-youtube-select/', 'pubDate': 'Tue, 19 May 2020 12:00:00 +0000'}
    {'title': 'Take a virtual travel day with Street View', 'link': 'https://www.blog.google/products/maps/virtual-travel-day-with-street-view/', 'pubDate': 'Tue, 19 May 2020 12:00:00 +0000'}
    {'title': 'How the Nest Hub Max helps keep families connected', 'link': 'https://www.blog.google/products/google-nest/nest-hub-max-covid-19-merrill-garden/', 'pubDate': 'Mon, 18 May 2020 18:00:00 +0000'}
    {'title': 'Support for Native small businesses during COVID-19', 'link': 'https://www.blog.google/outreach-initiatives/grow-with-google/grow-with-google-national-congress-of-american-indians/', 'pubDate': 'Mon, 18 May 2020 16:00:00 +0000'}
    {'title': 'Navigating the road ahead: How consumers are adjusting to COVID-19', 'link': 'https://www.blog.google/products/ads/consumers-adjusting-covid-19/', 'pubDate': 'Mon, 18 May 2020 15:00:00 +0000'}
    {'title': 'How AI could predict sight-threatening eye conditions', 'link': 'https://www.blog.google/technology/health/predicting-sight-threatening-eye-condition/', 'pubDate': 'Mon, 18 May 2020 15:00:00 +0000'}
    {'title': 'New automated bidding solutions in Display & Video 360', 'link': 'https://www.blog.google/products/marketingplatform/360/new-automated-bidding-solutions-display-video-360/', 'pubDate': 'Mon, 18 May 2020 09:00:00 +0000'}
    {'title': 'Stay "connected to culture" on International Museum Day', 'link': 'https://www.blog.google/outreach-initiatives/arts-culture/stay-connected-culture-international-museum-day/', 'pubDate': 'Mon, 18 May 2020 01:30:00 +0000'}

    Blog content is in the description element but not shown in our example as it's too lengthy to show here.

  9. Create a Scrapy spider based on the previous shell process (optional).
    scrape-rss.py
    import scrapy
     
     
    class ScrapeRssSpider(scrapy.Spider):
        name = 'scrape-rss'
        allowed_domains = ['https://www.blog.google/rss']
        start_urls = ['http://https://www.blog.google/rss/']
     
        def start_requests(self):
            urls = [
                'https://www.blog.google/rss',
            ]
            for url in urls:
                yield scrapy.Request(url=url, callback=self.parse)
     
        def parse(self, response):
            for post in response.xpath('//channel/item'):
                yield {
                    'title' : post.xpath('title//text()').extract_first(),
                    'link': post.xpath('link//text()').extract_first(),
                    'pubDate' : post.xpath('pubDate//text()').extract_first(),
                }
  10. Test Scrapy spider to see if it works.
    $ scrapy crawl --nolog --output -:json scrape-rss
    [
    {"title": "Find wheelchair accessible places with Google Maps", "link": "https://www.blog.google/products/maps/wheelchair-accessible-places-google-maps/", "pubDate": "Thu, 21 May 2020 16:30:00 +0000"},
    {"title": "Accessibility updates that help tech work for everyone", "link": "https://www.blog.google/products/android/accessibility-updates-help-tech-work-everyone/", "pubDate": "Thu, 21 May 2020 16:30:00 +0000"},
    {"title": "A is for accessibility: How to make remote learning work for everyone", "link": "https://www.blog.google/outreach-initiatives/education/global-accessibility-awareness-day-2020/", "pubDate": "Thu, 21 May 2020 16:30:00 +0000"},
    {"title": "Navigating the road ahead: The benefits of real-time marketing", "link": "https://www.blog.google/products/ads/real-time-marketing/", "pubDate": "Thu, 21 May 2020 15:00:00 +0000"},
    {"title": "Helping COVID-19 responders find hotels", "link": "https://www.blog.google/products/flights-hotels/covid-19-responder-hotel-rooms/", "pubDate": "Thu, 21 May 2020 14:00:00 +0000"},
    {"title": "A reintroduction to our Knowledge Graph and  knowledge panels", "link": "https://www.blog.google/products/search/about-knowledge-graph-and-knowledge-panels/", "pubDate": "Wed, 20 May 2020 17:00:00 +0000"},
    {"title": "Exposure Notification API launches to support public health agencies", "link": "https://www.blog.google/inside-google/company-announcements/apple-google-exposure-notification-api-launches/", "pubDate": "Wed, 20 May 2020 17:00:00 +0000"},
    {"title": "A Doodle dedicated to the aloha spirit", "link": "https://www.blog.google/inside-google/doodles/israel-kamakawiwoole-apahm/", "pubDate": "Tue, 19 May 2020 22:00:00 +0000"},
    {"title": "More intuitive privacy and security controls in Chrome", "link": "https://www.blog.google/products/chrome/more-intuitive-privacy-and-security-controls-chrome/", "pubDate": "Tue, 19 May 2020 16:00:00 +0000"},
    {"title": "New controls for how you share albums in Google Photos", "link": "https://www.blog.google/products/photos/new-controls-how-you-share-albums-google-photos/", "pubDate": "Tue, 19 May 2020 15:00:00 +0000"},
    {"title": "Deliver the best ad experience every time", "link": "https://www.blog.google/products/admanager/deliver-best-ad-experience-every-time/", "pubDate": "Tue, 19 May 2020 14:47:00 +0000"},
    {"title": "How Search Works", "link": "https://www.blog.google/products/search/how-search-works/", "pubDate": "Tue, 19 May 2020 13:00:00 +0000"},
    {"title": "Make the best of YouTube yours with YouTube Select", "link": "https://www.blog.google/products/ads/introducing-youtube-select/", "pubDate": "Tue, 19 May 2020 12:00:00 +0000"},
    {"title": "Take a virtual travel day with Street View", "link": "https://www.blog.google/products/maps/virtual-travel-day-with-street-view/", "pubDate": "Tue, 19 May 2020 12:00:00 +0000"},
    {"title": "How the Nest Hub Max helps keep families connected", "link": "https://www.blog.google/products/google-nest/nest-hub-max-covid-19-merrill-garden/", "pubDate": "Mon, 18 May 2020 18:00:00 +0000"},
    {"title": "Support for Native small businesses during COVID-19", "link": "https://www.blog.google/outreach-initiatives/grow-with-google/grow-with-google-national-congress-of-american-indians/", "pubDate": "Mon, 18 May 2020 16:00:00 +0000"},
    {"title": "Navigating the road ahead: How consumers are adjusting to COVID-19", "link": "https://www.blog.google/products/ads/consumers-adjusting-covid-19/", "pubDate": "Mon, 18 May 2020 15:00:00 +0000"},
    {"title": "How AI could predict sight-threatening eye conditions", "link": "https://www.blog.google/technology/health/predicting-sight-threatening-eye-condition/", "pubDate": "Mon, 18 May 2020 15:00:00 +0000"},
    {"title": "New automated bidding solutions in Display & Video 360", "link": "https://www.blog.google/products/marketingplatform/360/new-automated-bidding-solutions-display-video-360/", "pubDate": "Mon, 18 May 2020 09:00:00 +0000"},
    {"title": "Stay \"connected to culture\" on International Museum Day", "link": "https://www.blog.google/outreach-initiatives/arts-culture/stay-connected-culture-international-museum-day/", "pubDate": "Mon, 18 May 2020 01:30:00 +0000"}
    ]
Discuss the article:

Comment anonymously. Login not required.

Share!