CSV download endpoints often expose the same records that power tables, catalog exports, or reporting pages, so targeting the CSV directly keeps the crawl on a stable row-based source instead of brittle HTML selectors.

Scrapy includes CSVFeedSpider for this exact pattern. It requests the CSV URL like any other response, reads the header row into field names, and calls parse_row() once for each row so the spider can yield normal Scrapy items for feed export, pipelines, or follow-up requests.

The CSV URL still needs to return the file body itself, and non-default formats may need explicit delimiter, quotechar, or headers values. Very large CSV responses are still received as one response body, so confirm the direct download path first and keep memory use in mind before pointing the spider at multi-gigabyte exports.

Steps to scrape a CSV file with Scrapy using CSVFeedSpider:

  1. Check the direct CSV URL in Scrapy shell and confirm the response type plus the first few rows.
    $ scrapy shell --nolog https://files.example.net/data/products.csv -c '(response.status, response.headers.get(b"Content-Type"), response.text.splitlines()[:3])'
    (200, b'text/csv', ['sku,name,price,url', 'starter-001,Starter Plan,$29,https://shop.example.com/products/starter-plan.html', 'team-001,Team Plan,$79,https://shop.example.com/products/team-plan.html'])

    The response should be the CSV body itself, not an HTML landing page, login form, or expiring redirect target.

  2. Create a standalone CSVFeedSpider file with the CSV URL and one parse_row() callback.
    $ vi csv_feed_spider.py
    from scrapy.spiders import CSVFeedSpider
     
     
    class ProductCsvSpider(CSVFeedSpider):
        name = "product_csv"
        start_urls = ["https://files.example.net/data/products.csv"]
        delimiter = ","
        quotechar = '"'
     
        def parse_row(self, response, row):
            yield {
                "sku": row["sku"],
                "name": row["name"],
                "price": row["price"],
                "url": row["url"],
            }

    CSVFeedSpider uses the first row as headers by default. Add headers = [“sku”, “name”, “price”, “url”] when the file has no header row, and change delimiter or quotechar when the feed uses a different CSV dialect.

  3. Run the spider and overwrite the export file for the current crawl.
    $ scrapy runspider csv_feed_spider.py -O products.jsonl
    2026-04-16 05:42:54 [scrapy.utils.log] INFO: Scrapy 2.15.0 started (bot: scrapybot)
    ##### snipped #####
    2026-04-16 05:42:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://files.example.net/data/products.csv> (referer: None)
    2026-04-16 05:42:54 [scrapy.extensions.feedexport] INFO: Stored jsonl feed (3 items) in: products.jsonl
    2026-04-16 05:42:54 [scrapy.core.engine] INFO: Spider closed (finished)

    The .jsonl suffix selects JSON Lines export automatically, and -O replaces any existing products.jsonl file.

    Running runspider from inside an existing Scrapy project can pull in that project's settings, middleware, and pipelines instead of a neutral standalone configuration.

  4. Read the exported rows to confirm that each CSV row became one Scrapy item.
    $ cat products.jsonl
    {"sku": "starter-001", "name": "Starter Plan", "price": "$29", "url": "https://shop.example.com/products/starter-plan.html"}
    {"sku": "team-001", "name": "Team Plan", "price": "$79", "url": "https://shop.example.com/products/team-plan.html"}
    {"sku": "growth-001", "name": "Growth Plan", "price": "$129", "url": "https://shop.example.com/products/growth-plan.html"}

    Each line should contain one parsed item, which makes .jsonl easy to inspect, diff, or stream into later processing.