The common way of presenting data on websites are with the use of HTML
table and Scrapy
is perfect for the job.
An HTML
table starts with a table
tag with each row defined with tr
and column with td
tags respectively. Optionally thead
is used to group the header rows and tbody
to group the content rows.
To scrape data from HTML
table, basically we need to find the table that we're interested on in a website and iterate for each row the columns that we want to get our data from.
For this example we're to scrape Bootstrap's Table documentation page
In this case, the table is assigned the class
es of table
and table-striped
Here's the actual HTML
code for the table
<table class="table table-striped"> <thead> <tr> <th scope="col">#</th> <th scope="col">First</th> <th scope="col">Last</th> <th scope="col">Handle</th> </tr> </thead> <tbody> <tr> <th scope="row">1</th> <td>Mark</td> <td>Otto</td> <td>@mdo</td> </tr> <tr> <th scope="row">2</th> <td>Jacob</td> <td>Thornton</td> <td>@fat</td> </tr> <tr> <th scope="row">3</th> <td>Larry</td> <td>the Bird</td> <td>@twitter</td> </tr> </tbody> </table>
Scrapy
shell at the terminal with the web page URL
as an argument. $ scrapy shell https://getbootstrap.com/docs/4.0/content/tables/ 2020-05-26 02:52:01 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrapybot) 2020-05-26 02:52:01 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 18.9.0, Python 3.8.2 (default, Apr 27 2020, 15:53:34) - [GCC 9.3.0], pyOpenSSL 19.0.0 (OpenSSL 1.1.1f 31 Mar 2020), cryptography 2.8, Platform Linux-5.4.0-31-generic-x86_64-with-glibc2.29 2020-05-26 02:52:01 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor 2020-05-26 02:52:01 [scrapy.crawler] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0} ##### snipped
In [1]: response Out[1]: <200 https://getbootstrap.com/docs/4.0/content/tables/>
200
is the OK
success respond status code for HTTP
.
xpath
selector.In [2]: table = response.xpath('//*[@class="table table-striped"]') In [3]: table Out[3]: [<Selector xpath='//*[@class="table table-striped"]' data='<table class="table table-striped">\n ...'>]
In this case the table is assigned table table-striped
CSS
classes and that's what we use as our selector.
tbody
if applicable. In [4]: table = response.xpath('//*[@class="table table-striped"]//tbody') In [5]: table Out[5]: [<Selector xpath='//*[@class="table table-striped"]//tbody' data='<tbody>\n <tr>\n <th scope="row...'>]
tr
. In [6]: rows = table.xpath('//tr') In [7]: rows Out[7]: [<Selector xpath='//tr' data='<tr>\n <th scope="col">#</th>\n ...'>, <Selector xpath='//tr' data='<tr>\n <th scope="row">1</th>\n ...'>, <Selector xpath='//tr' data='<tr>\n <th scope="row">2</th>\n ...'>, <Selector xpath='//tr' data='<tr>\n <th scope="row">3</th>\n ...'>, <Selector xpath='//tr' data='<tr>\n <th scope="col">#</th>\n ...'>, ##### snipped
In [8]: row = rows[2]
Multiple rows are stored as an array.
<td>
selector and extract column's data. In [9]: row.xpath('td//text()')[0].extract() Out[9]: 'Jacob''
The first column uses <th>
instead of <td>
thus our array index starts at the First
column of the table.
for
loop. In [10]: for row in response.xpath('//*[@class="table table-striped"]//tbody//tr'): ...: name = { ...: 'first' : row.xpath('td[1]//text()').extract_first(), ...: 'last': row.xpath('td[2]//text()').extract_first(), ...: 'handle' : row.xpath('td[3]//text()').extract_first(), ...: } ...: print(name) ...: {'first': 'Mark', 'last': 'Otto', 'handle': '@mdo'} {'first': 'Jacob', 'last': 'Thornton', 'handle': '@fat'} {'first': 'Larry', 'last': 'the Bird', 'handle': '@twitter'}
Scrapy
spider from the previous codes (optional). import scrapy class ScrapeTableSpider(scrapy.Spider): name = 'scrape-table' allowed_domains = ['https://getbootstrap.com/docs/4.0/content/tables'] start_urls = ['http://https://getbootstrap.com/docs/4.0/content/tables/'] def start_requests(self): urls = [ 'https://getbootstrap.com/docs/4.0/content/tables', ] for url in urls: yield scrapy.Request(url=url, callback=self.parse) def parse(self, response): for row in response.xpath('//*[@class="table table-striped"]//tbody/tr'): yield { 'first' : row.xpath('td[1]//text()').extract_first(), 'last': row.xpath('td[2]//text()').extract_first(), 'handle' : row.xpath('td[3]//text()').extract_first(), }
JSON
output.$ scrapy crawl --nolog --output -:json scrape-table [ {"first": "Mark", "last": "Otto", "handle": "@mdo"}, {"first": "Jacob", "last": "Thornton", "handle": "@fat"}, {"first": "Larry", "last": "the Bird", "handle": "@twitter"} ]
Comment anonymously. Login not required.