Scraping a direct CSV file with Scrapy keeps the crawl on a row-based source instead of fragile HTML selectors, which works well for catalog exports, reporting downloads, and partner feeds that already expose the records as a file.

Current Scrapy still includes CSVFeedSpider for this pattern. It downloads the CSV as one response, uses the first row as headers unless headers is set explicitly, and calls parse_row() once per data row so the spider can yield normal items for feed export, pipelines, or follow-up requests.

The target URL still needs to return the raw CSV body instead of an HTML landing page, login form, or redirect chain, and large CSV downloads are still loaded into one response before row parsing starts. If the feed uses a different CSV dialect, set delimiter or quotechar explicitly, set headers when the file has no header row, and expect malformed rows with the wrong column count to be skipped after a warning.

Steps to scrape a CSV file with Scrapy using CSVFeedSpider:

  1. Check the direct CSV URL in Scrapy shell and confirm the header row plus the first records.
    $ scrapy shell --nolog https://files.example.net/data/products.csv -c 'response.text.splitlines()[:3]'
    ['sku,name,price,url', 'starter-001,Starter Plan,$29,https://shop.example.net/products/starter-plan', 'team-001,Team Plan,$79,https://shop.example.net/products/team-plan']

    The returned text should be CSV lines from the file itself, not an HTML download page or login response.

  2. Save a standalone CSVFeedSpider file with the CSV URL and one parse_row() callback.
    $ vi csv_feed_spider.py
    csv_feed_spider.py
    from scrapy.spiders import CSVFeedSpider
     
     
    class ProductCsvSpider(CSVFeedSpider):
        name = "product_csv"
        start_urls = ["https://files.example.net/data/products.csv"]
     
        def parse_row(self, response, row):
            yield {
                "sku": row["sku"],
                "name": row["name"],
                "price": row["price"],
                "url": row["url"],
            }

    CSVFeedSpider uses the first row as headers by default. Add headers = [“sku”, “name”, “price”, “url”] when the file has no header row, and add delimiter or quotechar only when the feed does not use the normal comma-plus-double-quote CSV format.

  3. Run the spider and overwrite the export file for the current crawl.
    $ scrapy runspider csv_feed_spider.py -O products.jsonl
    2026-04-16 05:42:54 [scrapy.utils.log] INFO: Scrapy 2.15.0 started (bot: scrapybot)
    ##### snipped #####
    2026-04-16 05:42:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://files.example.net/data/products.csv> (referer: None)
    2026-04-16 05:42:54 [scrapy.extensions.feedexport] INFO: Stored jsonl feed (3 items) in: products.jsonl
    2026-04-16 05:42:54 [scrapy.core.engine] INFO: Spider closed (finished)

    The .jsonl suffix selects JSON Lines export automatically, and -O replaces any existing products.jsonl file.

    Running runspider from inside an existing Scrapy project can pull in that project's settings, middleware, and pipelines instead of a neutral standalone configuration.

  4. Read the exported rows and confirm that each CSV row became one Scrapy item.
    $ cat products.jsonl
    {"sku": "starter-001", "name": "Starter Plan", "price": "$29", "url": "https://shop.example.net/products/starter-plan"}
    {"sku": "team-001", "name": "Team Plan", "price": "$79", "url": "https://shop.example.net/products/team-plan"}
    {"sku": "growth-001", "name": "Growth Plan", "price": "$129", "url": "https://shop.example.net/products/growth-plan"}

    Each line should contain one parsed item, which keeps the export easy to inspect, diff, or hand to later processing.