How to scrape an HTML table with Scrapy

Scraping tabular data from websites turns human-readable HTML tables into structured rows that can be exported, filtered, and reused in automation workflows. Pulling the table into fields such as names, IDs, or metrics avoids manual copy and reduces the chance of transcription errors.

Scrapy downloads the page content and exposes it through a selector tree for XPath or CSS queries. Most tables follow a predictable structure using <table>, <thead>, <tbody>, <tr>, <th>, and <td>, so scraping usually becomes selecting the correct table element and iterating through its data rows.

Scraping only sees the server-rendered response, so tables populated by JavaScript may not exist in the fetched HTML. Irregular table layouts using colspan/rowspan or row headers in <th> can shift column positions, so extraction should tolerate missing cells and unexpected row shapes.

Steps to scrape an HTML table with Scrapy:

Open the page containing the target HTML table.
```
http://app.internal.example:8000/table/
```

Inspect the <table> element in browser developer tools to identify a stable selector.

The example table can be selected by its id value pricing.

<table id="pricing">
  <thead>
    <tr><th>Plan</th><th>Price</th></tr>
  </thead>
  <tbody>
    <tr>
      <td>Starter Plan</td>
      <td>$29</td>
    </tr>
    <tr>
      <td>Team Plan</td>
      <td>$79</td>
    </tr>
    <tr>
      <td>Enterprise Plan</td>
      <td>$199</td>
    </tr>
  </tbody>
</table>

Launch Scrapy shell with the page URL.

$ scrapy shell http://app.internal.example:8000/table/
2026-01-01 09:13:50 [scrapy.utils.log] INFO: Scrapy 2.11.1 started (bot: simplifiedguide)
##### snipped #####

URL fragments starting with # are ignored by HTTP clients and do not affect the fetched response.

Confirm the request returned an HTTP 200 response.
```
In [1]: response
Out[1]: &lt;200 http://app.internal.example:8000/table/&gt;
```
200 means the page content was fetched successfully.

Select the table tbody using an XPath selector.

In [2]: table = response.xpath('//*[@class="table table-striped"]//tbody')

In [3]: table
Out[3]: [&lt;Selector query='//*[@id="pricing"]//tbody' data='&lt;tbody&gt;\n&lt;tr&gt;&lt;td&gt;Starter Plan&lt;/td&gt;&lt;td&gt;...'&gt;]

XPath matching with @id keeps selectors stable across class changes.

Iterate through each tbody row to print extracted columns.

In [4]: for row in response.xpath('//*[@id="pricing"]//tbody/tr'):
    ...:     item = {
    ...:         'plan': row.xpath('td[1]//text()').get(),
    ...:         'price': row.xpath('td[2]//text()').get(),
    ...:     }
    ...:     print(item)
    ...:
{'plan': 'Starter Plan', 'price': '$29'}
{'plan': 'Team Plan', 'price': '$79'}
{'plan': 'Enterprise Plan', 'price': '$199'}

The table uses <td> for both columns, so indexing starts at td[1] for the plan name.

Tables using colspan/rowspan can shift cell positions and misalign index-based extraction.

Add a spider that reuses the validated selector inside an existing Scrapy project.

scrape_table.py

import scrapy
 
 
class ScrapeTableSpider(scrapy.Spider):
    name = 'scrape-table'
    start_urls = ['http://app.internal.example:8000/table/']
 
    def parse(self, response):
        for row in response.xpath('//*[@id="pricing"]//tbody/tr'):
            yield {
                'plan': row.xpath('normalize-space(td[1])').get(),
                'price': row.xpath('normalize-space(td[2])').get(),
            }

Related: How to create a Scrapy spider

Run the spider from the project directory with JSON output to standard output.

$ scrapy runspider --nolog --output -:json scrape_table.py
[
{"plan": "Starter Plan", "price": "$29"},
{"plan": "Team Plan", "price": "$79"},
{"plan": "Enterprise Plan", "price": "$199"}
]

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.