It's mostly best to scrape the RSS feed instead of the HTML sites itself when scraping blogs or news sites whenever RSS feed is available. This is because RSS feeds are more structured and easier to parse.

For this example we'll scrape Google Blog's RSS feed at https://www.blog.google/

  1. Navigate to the site via web browser and search for RSS feed link or icon.
  2. Notice that it's basically an XML document. Examine the structure and notice that blog posts are in channel→item elements.
  3. Open scrapy shell at the command line.
    $ scrapy shell https://www.blog.google/rss/
  4. Check HTTP response status and make sure it returns 200.
    >>> response
    <200 https://www.blog.google/rss/>
  5. Search for blog posts with XPath based on the structure.
    >>> posts = response.xpath('//channel/item')
  6. Check returned item count to confirm. There should be 20 item (posts) in the feed as manually counted.
    >>> len(posts)
    20
  7. Extract the title for first and last item to confirm.
    >>> posts[0].xpath('title/text()').extract()
    [u'The High Five: Sip sip, hooray!']
    >>> posts[19].xpath('title/text()').extract()
    [u'Now on iOS: new vehicle icons to spice up your drive']
  8. Iterate through each item to get data for all blog posts. Blog content is in the description element but not shown in our example as it's too lengthy to show here.
    >>> for item in response.xpath('//channel/item'):
    ...   post = {
    ...     'title' : item.xpath('title//text()').extract_first(),
    ...     'link': item.xpath('link//text()').extract_first(),
    ...     'pubDate' : item.xpath('pubDate//text()').extract_first(),
    ...   }
    ...   print(post)
    ...
    {'link': u'https://www.blog.google/topics/trends/high-five-sip-sip-hooray/', 'pubDate': u'Fri, 25 May 2018 20:23:00 -0000', 'title': u'The High Five: Sip sip, hooray!'}
    {'link': u'https://www.blog.google/topics/inside-google/we-are-many-and-one-googlers-mark-aapi-heritage-month/', 'pubDate': u'Fri, 25 May 2018 18:44:00 -0000', 'title': u'We are many and one: Googlers mark AAPI Heritage Month'}
    {'link': u'https://www.blog.google/products/pixel/teampixel-rolls-out-red-carpet-week/', 'pubDate': u'Fri, 25 May 2018 18:07:00 -0000', 'title': u'#teampixel rolls out the red carpet this week'}
    {'link': u'https://www.blog.google/products/google-play/new-look-google-play-movies-tv-your-roku-device/', 'pubDate': u'Thu, 24 May 2018 17:00:00 -0000', 'title': u'A new look for Google Play Movies & TV on your Roku device'}
    {'link': u'https://www.blog.google/topics/education/google-science-fair-2018-resources-educators-get-ideas-flowing/', 'pubDate': u'Thu, 24 May 2018 16:40:00 -0000', 'title': u'Google Science Fair 2018: Resources for educators to get ideas flowing'}
    {'link': u'https://www.blog.google/topics/education/more-tools-homeschoolers/', 'pubDate': u'Thu, 24 May 2018 14:30:00 -0000', 'title': u'More tools for homeschoolers'}
    {'link': u'https://www.blog.google/topics/shopping-payments/add-suica-and-waon-google-pay-japan/', 'pubDate': u'Thu, 24 May 2018 14:00:00 -0000', 'title': u'Now you can add Suica and WAON to Google Pay in Japan'}
    {'link': u'https://www.blog.google/topics/arts-culture/faces-frida-digital-retrospective-google-arts-culture/', 'pubDate': u'Thu, 24 May 2018 01:00:00 -0000', 'title': u'Faces of Frida: a digital retrospective on Google Arts & Culture'}
    {'link': u'https://www.blog.google/topics/inside-google/court-how-nba-spent-day-google/', 'pubDate': u'Wed, 23 May 2018 17:00:00 -0000', 'title': u'Off the court: how the NBA spent a day at Google'}
    {'link': u'https://www.blog.google/topics/education/our-2018-professional-development-grants-support-cs-educators/', 'pubDate': u'Wed, 23 May 2018 16:40:00 -0000', 'title': u'Supporting CS educators in Europe, the Middle East and Africa'}
    {'link': u'https://www.blog.google/topics/trends/see-what-world-searching-updated-google-trends/', 'pubDate': u'Wed, 23 May 2018 16:30:00 -0000', 'title': u'See what the world is searching for with the updated Google Trends'}
    {'link': u'https://www.blog.google/products/g-suite/g-suite-pro-tips-how-sync-one-spreadsheet-another-google-sheets/', 'pubDate': u'Wed, 23 May 2018 16:00:00 -0000', 'title': u'G Suite Pro Tips: how to sync one spreadsheet to another in Google Sheets'}
    {'link': u'https://www.blog.google/products/google-play/first-person-personal-stories-creative-people-behind-mobile-gaming/', 'pubDate': u'Wed, 23 May 2018 16:00:00 -0000', 'title': u'First Person: The personal stories of the creative people behind mobile gaming'}
    {'link': u'https://www.blog.google/topics/small-business/taking-action-against-scammers/', 'pubDate': u'Wed, 23 May 2018 16:00:00 -0000', 'title': u'Taking action against scammers'}
    {'link': u'https://www.blog.google/topics/google-europe/100-million-skills-and-opportunity-europe-middle-east-and-africa/', 'pubDate': u'Wed, 23 May 2018 14:00:00 -0000', 'title': u'$100 million for skills and opportunity in Europe, Middle East, and Africa'}
    {'link': u'https://www.blog.google/topics/machine-learning/new-york-times-using-ai-host-better-conversations/', 'pubDate': u'Wed, 23 May 2018 13:00:00 -0000', 'title': u'New York Times: Using AI to host better conversations'}
    {'link': u'https://www.blog.google/topics/google-asia/apply-for-demo-day-asia/', 'pubDate': u'Wed, 23 May 2018 02:00:00 -0000', 'title': u'Seize the day and take the stage at Demo Day Asia'}
    {'link': u'https://www.blog.google/topics/connected-workspaces/how-climatecom-uses-chrome-browser-more-connected-and-productive-workforce/', 'pubDate': u'Tue, 22 May 2018 16:00:00 -0000', 'title': u'How Climate.com uses Chrome Browser for a more connected and productive workforce'}
    {'link': u'https://www.blog.google/products/android-enterprise/android-p-more-power-enterprises/', 'pubDate': u'Tue, 22 May 2018 13:10:00 -0000', 'title': u'Android P: More power for enterprises'}
    {'link': u'https://www.blog.google/products/maps/now-ios-new-vehicle-icons-spice-your-drive/', 'pubDate': u'Mon, 21 May 2018 17:00:00 -0000', 'title': u'Now on iOS: new vehicle icons to spice up your drive'}
  9. You can now create a spider to be part of your scrapy project.
    google_rss.py
    import scrapy
     
    class GoogleRssSpider(scrapy.Spider):
        name = "google_rss"
     
        def start_requests(self):
            urls = [
                'https://www.blog.google/rss/',
            ]
            for url in urls:
                yield scrapy.Request(url=url, callback=self.parse)
     
        def parse(self, response):
            for post in response.xpath('//channel/item'):
                yield {
                    'title' : post.xpath('title//text()').extract_first(),
                    'link': post.xpath('link//text()').extract_first(),
                    'pubDate' : post.xpath('pubDate//text()').extract_first(),
                }
  10. Let's test spider and see if it works.
    $ scrapy crawl --nolog -o - -t json google_rss
    [
    {"link": "https://www.blog.google/topics/trends/high-five-sip-sip-hooray/", "pubDate": "Fri, 25 May 2018 20:23:00 -0000", "title": "The High Five: Sip sip, hooray!"},
    {"link": "https://www.blog.google/topics/inside-google/we-are-many-and-one-googlers-mark-aapi-heritage-month/", "pubDate": "Fri, 25 May 2018 18:44:00 -0000", "title": "We are many and one: Googlers mark AAPI Heritage Month"},
    {"link": "https://www.blog.google/products/pixel/teampixel-rolls-out-red-carpet-week/", "pubDate": "Fri, 25 May 2018 18:07:00 -0000", "title": "#teampixel rolls out the red carpet this week"},
    {"link": "https://www.blog.google/products/google-play/new-look-google-play-movies-tv-your-roku-device/", "pubDate": "Thu, 24 May 2018 17:00:00 -0000", "title": "A new look for Google Play Movies & TV on your Roku device"},
    {"link": "https://www.blog.google/topics/education/google-science-fair-2018-resources-educators-get-ideas-flowing/", "pubDate": "Thu, 24 May 2018 16:40:00 -0000", "title": "Google Science Fair 2018: Resources for educators to get ideas flowing"},
    {"link": "https://www.blog.google/topics/education/more-tools-homeschoolers/", "pubDate": "Thu, 24 May 2018 14:30:00 -0000", "title": "More tools for homeschoolers"},
    {"link": "https://www.blog.google/topics/shopping-payments/add-suica-and-waon-google-pay-japan/", "pubDate": "Thu, 24 May 2018 14:00:00 -0000", "title": "Now you can add Suica and WAON to Google Pay in Japan"},
    {"link": "https://www.blog.google/topics/arts-culture/faces-frida-digital-retrospective-google-arts-culture/", "pubDate": "Thu, 24 May 2018 01:00:00 -0000", "title": "Faces of Frida: a digital retrospective on Google Arts & Culture"},
    {"link": "https://www.blog.google/topics/inside-google/court-how-nba-spent-day-google/", "pubDate": "Wed, 23 May 2018 17:00:00 -0000", "title": "Off the court: how the NBA spent a day at Google"},
    {"link": "https://www.blog.google/topics/education/our-2018-professional-development-grants-support-cs-educators/", "pubDate": "Wed, 23 May 2018 16:40:00 -0000", "title": "Supporting CS educators in Europe, the Middle East and Africa"},
    {"link": "https://www.blog.google/topics/trends/see-what-world-searching-updated-google-trends/", "pubDate": "Wed, 23 May 2018 16:30:00 -0000", "title": "See what the world is searching for with the updated Google Trends"},
    {"link": "https://www.blog.google/products/g-suite/g-suite-pro-tips-how-sync-one-spreadsheet-another-google-sheets/", "pubDate": "Wed, 23 May 2018 16:00:00 -0000", "title": "G Suite Pro Tips: how to sync one spreadsheet to another in Google Sheets"},
    {"link": "https://www.blog.google/products/google-play/first-person-personal-stories-creative-people-behind-mobile-gaming/", "pubDate": "Wed, 23 May 2018 16:00:00 -0000", "title": "First Person: The personal stories of the creative people behind mobile gaming"},
    {"link": "https://www.blog.google/topics/small-business/taking-action-against-scammers/", "pubDate": "Wed, 23 May 2018 16:00:00 -0000", "title": "Taking action against scammers"},
    {"link": "https://www.blog.google/topics/google-europe/100-million-skills-and-opportunity-europe-middle-east-and-africa/", "pubDate": "Wed, 23 May 2018 14:00:00 -0000", "title": "$100 million for skills and opportunity in Europe, Middle East, and Africa"},
    {"link": "https://www.blog.google/topics/machine-learning/new-york-times-using-ai-host-better-conversations/", "pubDate": "Wed, 23 May 2018 13:00:00 -0000", "title": "New York Times: Using AI to host better conversations"},
    {"link": "https://www.blog.google/topics/google-asia/apply-for-demo-day-asia/", "pubDate": "Wed, 23 May 2018 02:00:00 -0000", "title": "Seize the day and take the stage at Demo Day Asia"},
    {"link": "https://www.blog.google/topics/connected-workspaces/how-climatecom-uses-chrome-browser-more-connected-and-productive-workforce/", "pubDate": "Tue, 22 May 2018 16:00:00 -0000", "title": "How Climate.com uses Chrome Browser for a more connected and productive workforce"},
    {"link": "https://www.blog.google/products/android-enterprise/android-p-more-power-enterprises/", "pubDate": "Tue, 22 May 2018 13:10:00 -0000", "title": "Android P: More power for enterprises"},
    {"link": "https://www.blog.google/products/maps/now-ios-new-vehicle-icons-spice-your-drive/", "pubDate": "Mon, 21 May 2018 17:00:00 -0000", "title": "Now on iOS: new vehicle icons to spice up your drive"}
Leave a comment:
Share!