Share!

Scrapy is an excellent tool to scrape websites. One common use case is to scrape HTML table data whereas you'll need to iterate for each rows and columns for the data you need.

For this example we're going to scrape Bootstrap's documentation page for tables. Specifically, we'll work on the Striped rows example table.

  1. Inspect element of table via built-in developer tools of the browser or by viewing the source code. In this case, the table is assigned the classes of table and table-striped.
  2. Here's the actual HTML code for the table
    <table class="table table-striped">
      <thead>
        <tr>
          <th scope="col">#</th>
          <th scope="col">First</th>
          <th scope="col">Last</th>
          <th scope="col">Handle</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th scope="row">1</th>
          <td>Mark</td>
          <td>Otto</td>
          <td>@mdo</td>
        </tr>
        <tr>
          <th scope="row">2</th>
          <td>Jacob</td>
          <td>Thornton</td>
          <td>@fat</td>
        </tr>
        <tr>
          <th scope="row">3</th>
          <td>Larry</td>
          <td>the Bird</td>
          <td>@twitter</td>
        </tr>
      </tbody>
    </table>
  3. Launch scrapy shell at the terminal.
    $ scrapy shell https://getbootstrap.com/docs/4.0/content/tables/
  4. Check HTTP response code to see if the request was successful. 200 is the OK success respond status code for HTTP.
    >>> response
    <200 https://getbootstrap.com/docs/4.0/content/tables/>
  5. Search for the exact table via the xpath selector. In this case it's via table table-striped CSS class.
    >>> table = response.xpath('//*[@class="table table-striped"]')
    >>> table
    [<Selector xpath='//*[@class="table table-striped"]' data=u'<table class="table table-striped">\n  <t'>]
  6. The data that we're interested are actually in the <tbody> section, so we'll narrow it down for a bit.
    >>> table = response.xpath('//*[@class="table table-striped"]//tbody')
    >>> table
    [<Selector xpath='//*[@class="table table-striped"]//tbody' data=u'<tbody>\n    <tr>\n      <th scope="row">1'>]
  7. From there we get the table's rows via <tr>.
    >>> rows = table.xpath('//tr')
    >>> rows
    [<Selector xpath='//tr' data=u'<tr>\n      <th scope="col">#</th>\n      '>, <Selector xpath='//tr' data=u'<tr>\n      <th scope="row">1</th>\n      '>, <Selector xpath='//tr' data=u'<tr>\n      <th scope="row">2</th>\n      '>, <Selector xpath='//tr' data=u'<tr>\n      <th scope="row">3</th>\n      '>,
    #---snipped---
  8. Multiple rows are stored in array. Let's work on the second row.
    >>> row = rows[2]
  9. Access the row's column via the <td> selector. Let's also extract the text of all the first column.
    >>> row.xpath('td//text()')[0].extract()
    u'Jacob'

    The first column uses <th> instead of <td> thus our array index starts at the First column of the table.

  10. It's now time to combine everything into a complete code by iterating each rows with a for loop.
    >>> for row in response.xpath('//*[@class="table table-striped"]//tbody//tr'):
    ...     name = {
    ...         'first' : row.xpath('td[1]//text()').extract_first(),
    ...         'last': row.xpath('td[2]//text()').extract_first(),
    ...         'handle' : row.xpath('td[3]//text()').extract_first(),
    ...     }
    ...     print(name)
    ...
    {'handle': u'@mdo', 'last': u'Otto', 'first': u'Mark'}
    {'handle': u'@fat', 'last': u'Thornton', 'first': u'Jacob'}
    {'handle': u'@twitter', 'last': u'the Bird', 'first': u'Larry'}
  11. Create a spider named bootstrap_table if you're to run it under a project.
    import scrapy
     
    class BootstrapTableSpider(scrapy.Spider):
        name = "bootstrap_table"
     
        def start_requests(self):
            urls = [
                'https://getbootstrap.com/docs/4.0/content/tables/',
            ]
            for url in urls:
                yield scrapy.Request(url=url, callback=self.parse)
     
        def parse(self, response):
            for row in response.xpath('//*[@class="table table-striped"]//tbody/tr'):
                yield {
                    'first' : row.xpath('td[1]//text()').extract_first(),
                    'last': row.xpath('td[2]//text()').extract_first(),
                    'handle' : row.xpath('td[3]//text()').extract_first(),
                }
  12. Run the spider with JSON output.
    $ scrapy crawl --nolog -o - -t json bootstrap_table
    [
    {"last": "Otto", "handle": "@mdo", "first": "Mark"},
    {"last": "Thornton", "handle": "@fat", "first": "Jacob"},
    {"last": "the Bird", "handle": "@twitter", "first": "Larry"}
    ]