Scraping an HTML table with Scrapy turns a page that is easy to read in a browser into rows that can be exported, filtered, and loaded into later automation. That is useful when pricing, inventory, schedules, or status pages are published as tables instead of API responses.
Scrapy fetches the page and exposes the response body through XPath and CSS selectors. Table extraction usually means selecting one stable <table> element, limiting the row selection to <tbody><tr>, and reading each cell from the current row instead of querying the whole document repeatedly.
The extracted rows only exist when the server response already contains the table markup, so JavaScript-rendered tables, login-gated pages, and anti-bot responses can still leave the selector empty. Tables that use rowspan, colspan, nested links, or missing cells also need more careful field mapping than a simple fixed-column table.
Related: How to use CSS selectors in Scrapy
Related: How to export Scrapy items to CSV
Steps to scrape an HTML table with Scrapy:
- Open the target page and confirm the table contains the rows and columns that should become Scrapy items.

- Inspect the table element in browser developer tools and capture a selector that stays stable across page refreshes.
Prefer an explicit id such as pricing over a long class match when the table already exposes a unique identifier.
- Start scrapy shell with the target page so the selector can be tested against the real response.
$ scrapy shell 'https://catalog.example.com/pricing/' --nolog [s] Available Scrapy objects: [s] response <200 https://catalog.example.com/pricing/> [s] Useful shortcuts: [s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed) ##### snipped ##### >>>
Related: How to use Scrapy shell
- Confirm the table headers and row count before extracting cell values.
>>> response.xpath('//table[@id="pricing"]/thead/tr/th/text()').getall() ['Plan', 'Price'] >>> rows = response.xpath('//table[@id="pricing"]/tbody/tr') >>> len(rows) 3Selecting tbody/tr skips the header row and keeps the extraction loop focused on data rows only.
- Extract the cells from each matched row with relative XPath expressions.
>>> for row in rows: ... print({ ... "plan": row.xpath('normalize-space(./td[1])').get(), ... "price": row.xpath('normalize-space(./td[2])').get(), ... }) ... {'plan': 'Starter Plan', 'price': '$29'} {'plan': 'Team Plan', 'price': '$79'} {'plan': 'Enterprise Plan', 'price': '$199'}Use relative XPath such as ./td[1] from the current row selector so each field comes from that row instead of restarting at the document root.
Leading // or / inside a nested selector turns the XPath back into a document-wide query and can duplicate or misalign table values.
Related: How to use XPath selectors in Scrapy
- Save the validated selector in a standalone spider file.
- table_spider.py
import scrapy class PricingTableSpider(scrapy.Spider): name = "pricing_table" start_urls = ["https://catalog.example.com/pricing/"] def parse(self, response): for row in response.xpath('//table[@id="pricing"]/tbody/tr'): yield { "plan": row.xpath("normalize-space(./td[1])").get(), "price": row.xpath("normalize-space(./td[2])").get(), }
Use How to create a Scrapy spider when the table scraper should live inside a normal Scrapy project instead of a one-file standalone spider.
- Run the spider and overwrite the current JSON export with the latest table rows.
$ scrapy runspider table_spider.py -O table_rows.json ##### snipped ##### 2026-04-16 06:12:47 [scrapy.extensions.feedexport] INFO: Stored json feed (3 items) in: table_rows.json 2026-04-16 06:12:47 [scrapy.core.engine] INFO: Spider closed (finished)
runspider is a global Scrapy command, so the same verification flow works outside a full project while the selector is still being tested.
- Open the exported file and confirm that each table row became one item.
$ cat table_rows.json [ {"plan": "Starter Plan", "price": "$29"}, {"plan": "Team Plan", "price": "$79"}, {"plan": "Enterprise Plan", "price": "$199"} ]
Notes
- Use normalize-space(./td[n]) when cells may contain extra spaces, line breaks, or nested markup that should collapse into one string.
- Header-driven tables can also be mapped dynamically by extracting <th> labels first and then zipping them with the row cells when column order changes between pages.
- Tables rendered only after JavaScript runs usually need the backing API, a different non-JavaScript endpoint, or a browser-rendering workflow instead of a plain Scrapy request.
- Export to How to export Scrapy items to CSV or another feed format when the rows need to move into spreadsheets or downstream data pipelines after extraction.
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
