RSS is designed as a way for users and applications to easily access updates of a website. It normally contain snippets of the latest website content and is in a standardised XML format. It is therefore one of the best point for a web scraper such as Scrapy to get the latest update of a website.

News sites and blogs normally provide RSS feed and will normally provide a link to the feed using the official RSS icon.

You can monitor when a website is updated and get the latest content by scraping the RSS feed using Scrapy.

Steps to scrape RSS feed using Scrapy:

  1. Navigate to the site via web browser and search for RSS feed link or icon.
  2. Click on the RSS link to view and examine the RSS feed.

    Notice that it's basically an XML document blog posts are in channel→item elements.

  3. Open Scrapy shell at the command line with the RSS feed URL as an argument.
    $ scrapy shell https://www.blog.google/rss/
  4. Check HTTP response status and make sure it returns 200.
    >>> response
    <200 https://www.blog.google/rss/>
  5. Search for blog posts with XPath based on the structure.
    >>> posts = response.xpath('//channel/item')
  6. Check returned item count to confirm.
    >>> len(posts)
    20
  7. Extract an element from the first and last item to test.
    >>> posts[0].xpath('title/text()').extract()
    [u'The High Five: Sip sip, hooray!']
    >>> posts[19].xpath('title/text()').extract()
    [u'Now on iOS: new vehicle icons to spice up your drive']
  8. Iterate through each item to get all required data.
    >>> for item in response.xpath('//channel/item'):
    ...   post = {
    ...     'title' : item.xpath('title//text()').extract_first(),
    ...     'link': item.xpath('link//text()').extract_first(),
    ...     'pubDate' : item.xpath('pubDate//text()').extract_first(),
    ...   }
    ...   print(post)
    ...
    {'link': u'https://www.blog.google/topics/trends/high-five-sip-sip-hooray/', 'pubDate': u'Fri, 25 May 2018 20:23:00 -0000', 'title': u'The High Five: Sip sip, hooray!'}
    {'link': u'https://www.blog.google/topics/inside-google/we-are-many-and-one-googlers-mark-aapi-heritage-month/', 'pubDate': u'Fri, 25 May 2018 18:44:00 -0000', 'title': u'We are many and one: Googlers mark AAPI Heritage Month'}
    {'link': u'https://www.blog.google/products/pixel/teampixel-rolls-out-red-carpet-week/', 'pubDate': u'Fri, 25 May 2018 18:07:00 -0000', 'title': u'#teampixel rolls out the red carpet this week'}
    {'link': u'https://www.blog.google/products/google-play/new-look-google-play-movies-tv-your-roku-device/', 'pubDate': u'Thu, 24 May 2018 17:00:00 -0000', 'title': u'A new look for Google Play Movies & TV on your Roku device'}
    {'link': u'https://www.blog.google/topics/education/google-science-fair-2018-resources-educators-get-ideas-flowing/', 'pubDate': u'Thu, 24 May 2018 16:40:00 -0000', 'title': u'Google Science Fair 2018: Resources for educators to get ideas flowing'}
    {'link': u'https://www.blog.google/topics/education/more-tools-homeschoolers/', 'pubDate': u'Thu, 24 May 2018 14:30:00 -0000', 'title': u'More tools for homeschoolers'}
    {'link': u'https://www.blog.google/topics/shopping-payments/add-suica-and-waon-google-pay-japan/', 'pubDate': u'Thu, 24 May 2018 14:00:00 -0000', 'title': u'Now you can add Suica and WAON to Google Pay in Japan'}
    {'link': u'https://www.blog.google/topics/arts-culture/faces-frida-digital-retrospective-google-arts-culture/', 'pubDate': u'Thu, 24 May 2018 01:00:00 -0000', 'title': u'Faces of Frida: a digital retrospective on Google Arts & Culture'}
    {'link': u'https://www.blog.google/topics/inside-google/court-how-nba-spent-day-google/', 'pubDate': u'Wed, 23 May 2018 17:00:00 -0000', 'title': u'Off the court: how the NBA spent a day at Google'}
    {'link': u'https://www.blog.google/topics/education/our-2018-professional-development-grants-support-cs-educators/', 'pubDate': u'Wed, 23 May 2018 16:40:00 -0000', 'title': u'Supporting CS educators in Europe, the Middle East and Africa'}
    {'link': u'https://www.blog.google/topics/trends/see-what-world-searching-updated-google-trends/', 'pubDate': u'Wed, 23 May 2018 16:30:00 -0000', 'title': u'See what the world is searching for with the updated Google Trends'}
    {'link': u'https://www.blog.google/products/g-suite/g-suite-pro-tips-how-sync-one-spreadsheet-another-google-sheets/', 'pubDate': u'Wed, 23 May 2018 16:00:00 -0000', 'title': u'G Suite Pro Tips: how to sync one spreadsheet to another in Google Sheets'}
    {'link': u'https://www.blog.google/products/google-play/first-person-personal-stories-creative-people-behind-mobile-gaming/', 'pubDate': u'Wed, 23 May 2018 16:00:00 -0000', 'title': u'First Person: The personal stories of the creative people behind mobile gaming'}
    {'link': u'https://www.blog.google/topics/small-business/taking-action-against-scammers/', 'pubDate': u'Wed, 23 May 2018 16:00:00 -0000', 'title': u'Taking action against scammers'}
    {'link': u'https://www.blog.google/topics/google-europe/100-million-skills-and-opportunity-europe-middle-east-and-africa/', 'pubDate': u'Wed, 23 May 2018 14:00:00 -0000', 'title': u'$100 million for skills and opportunity in Europe, Middle East, and Africa'}
    {'link': u'https://www.blog.google/topics/machine-learning/new-york-times-using-ai-host-better-conversations/', 'pubDate': u'Wed, 23 May 2018 13:00:00 -0000', 'title': u'New York Times: Using AI to host better conversations'}
    {'link': u'https://www.blog.google/topics/google-asia/apply-for-demo-day-asia/', 'pubDate': u'Wed, 23 May 2018 02:00:00 -0000', 'title': u'Seize the day and take the stage at Demo Day Asia'}
    {'link': u'https://www.blog.google/topics/connected-workspaces/how-climatecom-uses-chrome-browser-more-connected-and-productive-workforce/', 'pubDate': u'Tue, 22 May 2018 16:00:00 -0000', 'title': u'How Climate.com uses Chrome Browser for a more connected and productive workforce'}
    {'link': u'https://www.blog.google/products/android-enterprise/android-p-more-power-enterprises/', 'pubDate': u'Tue, 22 May 2018 13:10:00 -0000', 'title': u'Android P: More power for enterprises'}
    {'link': u'https://www.blog.google/products/maps/now-ios-new-vehicle-icons-spice-your-drive/', 'pubDate': u'Mon, 21 May 2018 17:00:00 -0000', 'title': u'Now on iOS: new vehicle icons to spice up your drive'}

    Blog content is in the description element but not shown in our example as it's too lengthy to show here.

  9. Create a Scrapy spider based on the previous shell process (optional).
    google_rss.py
    import scrapy
     
    class GoogleRssSpider(scrapy.Spider):
        name = "google_rss"
     
        def start_requests(self):
            urls = [
                'https://www.blog.google/rss/',
            ]
            for url in urls:
                yield scrapy.Request(url=url, callback=self.parse)
     
        def parse(self, response):
            for post in response.xpath('//channel/item'):
                yield {
                    'title' : post.xpath('title//text()').extract_first(),
                    'link': post.xpath('link//text()').extract_first(),
                    'pubDate' : post.xpath('pubDate//text()').extract_first(),
                }
  10. Test Scrapy spider to see if it works.
    $ scrapy crawl --nolog -o - -t json google_rss
    [
    {"link": "https://www.blog.google/topics/trends/high-five-sip-sip-hooray/", "pubDate": "Fri, 25 May 2018 20:23:00 -0000", "title": "The High Five: Sip sip, hooray!"},
    {"link": "https://www.blog.google/topics/inside-google/we-are-many-and-one-googlers-mark-aapi-heritage-month/", "pubDate": "Fri, 25 May 2018 18:44:00 -0000", "title": "We are many and one: Googlers mark AAPI Heritage Month"},
    {"link": "https://www.blog.google/products/pixel/teampixel-rolls-out-red-carpet-week/", "pubDate": "Fri, 25 May 2018 18:07:00 -0000", "title": "#teampixel rolls out the red carpet this week"},
    {"link": "https://www.blog.google/products/google-play/new-look-google-play-movies-tv-your-roku-device/", "pubDate": "Thu, 24 May 2018 17:00:00 -0000", "title": "A new look for Google Play Movies & TV on your Roku device"},
    {"link": "https://www.blog.google/topics/education/google-science-fair-2018-resources-educators-get-ideas-flowing/", "pubDate": "Thu, 24 May 2018 16:40:00 -0000", "title": "Google Science Fair 2018: Resources for educators to get ideas flowing"},
    {"link": "https://www.blog.google/topics/education/more-tools-homeschoolers/", "pubDate": "Thu, 24 May 2018 14:30:00 -0000", "title": "More tools for homeschoolers"},
    {"link": "https://www.blog.google/topics/shopping-payments/add-suica-and-waon-google-pay-japan/", "pubDate": "Thu, 24 May 2018 14:00:00 -0000", "title": "Now you can add Suica and WAON to Google Pay in Japan"},
    {"link": "https://www.blog.google/topics/arts-culture/faces-frida-digital-retrospective-google-arts-culture/", "pubDate": "Thu, 24 May 2018 01:00:00 -0000", "title": "Faces of Frida: a digital retrospective on Google Arts & Culture"},
    {"link": "https://www.blog.google/topics/inside-google/court-how-nba-spent-day-google/", "pubDate": "Wed, 23 May 2018 17:00:00 -0000", "title": "Off the court: how the NBA spent a day at Google"},
    {"link": "https://www.blog.google/topics/education/our-2018-professional-development-grants-support-cs-educators/", "pubDate": "Wed, 23 May 2018 16:40:00 -0000", "title": "Supporting CS educators in Europe, the Middle East and Africa"},
    {"link": "https://www.blog.google/topics/trends/see-what-world-searching-updated-google-trends/", "pubDate": "Wed, 23 May 2018 16:30:00 -0000", "title": "See what the world is searching for with the updated Google Trends"},
    {"link": "https://www.blog.google/products/g-suite/g-suite-pro-tips-how-sync-one-spreadsheet-another-google-sheets/", "pubDate": "Wed, 23 May 2018 16:00:00 -0000", "title": "G Suite Pro Tips: how to sync one spreadsheet to another in Google Sheets"},
    {"link": "https://www.blog.google/products/google-play/first-person-personal-stories-creative-people-behind-mobile-gaming/", "pubDate": "Wed, 23 May 2018 16:00:00 -0000", "title": "First Person: The personal stories of the creative people behind mobile gaming"},
    {"link": "https://www.blog.google/topics/small-business/taking-action-against-scammers/", "pubDate": "Wed, 23 May 2018 16:00:00 -0000", "title": "Taking action against scammers"},
    {"link": "https://www.blog.google/topics/google-europe/100-million-skills-and-opportunity-europe-middle-east-and-africa/", "pubDate": "Wed, 23 May 2018 14:00:00 -0000", "title": "$100 million for skills and opportunity in Europe, Middle East, and Africa"},
    {"link": "https://www.blog.google/topics/machine-learning/new-york-times-using-ai-host-better-conversations/", "pubDate": "Wed, 23 May 2018 13:00:00 -0000", "title": "New York Times: Using AI to host better conversations"},
    {"link": "https://www.blog.google/topics/google-asia/apply-for-demo-day-asia/", "pubDate": "Wed, 23 May 2018 02:00:00 -0000", "title": "Seize the day and take the stage at Demo Day Asia"},
    {"link": "https://www.blog.google/topics/connected-workspaces/how-climatecom-uses-chrome-browser-more-connected-and-productive-workforce/", "pubDate": "Tue, 22 May 2018 16:00:00 -0000", "title": "How Climate.com uses Chrome Browser for a more connected and productive workforce"},
    {"link": "https://www.blog.google/products/android-enterprise/android-p-more-power-enterprises/", "pubDate": "Tue, 22 May 2018 13:10:00 -0000", "title": "Android P: More power for enterprises"},
    {"link": "https://www.blog.google/products/maps/now-ios-new-vehicle-icons-spice-your-drive/", "pubDate": "Mon, 21 May 2018 17:00:00 -0000", "title": "Now on iOS: new vehicle icons to spice up your drive"}
Discuss the article:

Comment anonymously. Login not required.

Share!