CSV downloads often contain the same data shown on a web page, but in a format that is already structured and ready for processing. Pulling the .csv file directly avoids brittle HTML selectors and preserves clean columns for analysis, import, or automation.
Scrapy can request the CSV URL like any other endpoint and expose the response as text through response.text. The Python csv module can then parse rows into dictionaries, allowing each row to be emitted as a Scrapy item for storage, pipelines, or additional crawling.
Many CSV endpoints are protected by authentication, anti-bot rate limits, or short-lived download tokens, and some servers return compressed or non-UTF-8 content. Large CSV files are downloaded into memory before parsing, so keep file size in mind and confirm the target site permits automated access.
Related: How to scrape an XML file with Scrapy
Related: How to export Scrapy items to CSV
Steps to scrape a CSV file with Scrapy:
- Open Scrapy shell for the CSV download URL.
$ scrapy shell http://files.example.net:8000/data/products.csv 2026-01-01 09:03:28 [scrapy.utils.log] INFO: Scrapy 2.11.1 started (bot: simplifiedguide) ##### snipped #####
Replace the URL with the direct .csv download endpoint.
- Confirm the CSV URL returns a successful response.
>>> response <200 http://files.example.net:8000/data/products.csv> >>> response.headers.get(b"Content-Type") b'text/csv'
3xx responses usually indicate a redirect to a signed download URL, and response.url shows the final location.
- Preview the header row with a few records.
>>> response.text.splitlines()[:4] ['name,price,url', 'Starter Plan,$29,http://app.internal.example:8000/products/starter-plan.html', 'Team Plan,$79,http://app.internal.example:8000/products/team-plan.html', 'Enterprise Plan,$199,http://app.internal.example:8000/products/enterprise-plan.html']
Semicolon-delimited files require a custom delimiter such as delimiter=';'.
- Test CSV parsing in the shell using Python's csv.DictReader.
>>> import csv, io >>> reader = csv.DictReader(io.StringIO(response.text.lstrip("\ufeff"))) >>> next(reader) {'name': 'Starter Plan', 'price': '$29', 'url': 'http://app.internal.example:8000/products/starter-plan.html'}CSV without a header row requires column names via the fieldnames argument.
- Create a stand-alone spider file that yields one item per CSV row.
- scrape_csv.py
import csv import io import scrapy class ScrapeCsvSpider(scrapy.Spider): """Scrape a remote CSV file and emit each row as an item.""" name = "scrape-csv" start_urls = ["http://files.example.net:8000/data/products.csv"] def parse(self, response): """Parse the CSV response and yield a dictionary for each row.""" text = response.text.lstrip("\ufeff") try: dialect = csv.Sniffer().sniff(text[:2048]) except csv.Error: dialect = csv.excel reader = csv.DictReader(io.StringIO(text), dialect=dialect) for row in reader: if not any(row.values()): continue clean_row = {} for key, value in row.items(): if key is None: continue clean_row[key.strip()] = value.strip() if isinstance(value, str) else value yield clean_row
Related: How to create a Scrapy spider
- Run the spider file using Scrapy.
$ scrapy runspider --nolog --output -:json scrape_csv.py [ {"name": "Starter Plan", "price": "$29", "url": "http://app.internal.example:8000/products/starter-plan.html"}, {"name": "Team Plan", "price": "$79", "url": "http://app.internal.example:8000/products/team-plan.html"}, {"name": "Enterprise Plan", "price": "$199", "url": "http://app.internal.example:8000/products/enterprise-plan.html"} ]
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
