Many datasets are published as downloadable attachments—PDF reports, spreadsheets, images—linked from an index page or listing. Downloading those files during a Scrapy crawl keeps each scraped item complete and avoids gaps caused by missing attachments.
Scrapy downloads files through the built-in FilesPipeline, which reads a list of URLs from an item field named file_urls by default. Each successful download is saved under the FILES_STORE location and a corresponding entry is added to the item’s files field, including the stored path and a content checksum.
File downloads increase bandwidth and disk I/O, so high concurrency can bottleneck a crawl or fill a filesystem unexpectedly. Some sites gate downloads behind cookies, short-lived signed URLs, or anti-bot checks, so file requests may need the same session context as the page request. Keep FILES_STORE outside the project directory (or exclude it from version control) to avoid committing large binaries.
Related: How to enable item pipelines in Scrapy
Related: How to set a download delay in Scrapy
$ vi downloads_demo/spiders/reports.py
import scrapy class ReportsSpider(scrapy.Spider): name = "reports" start_urls = ["http://downloads.example.net:8000/downloads/"] def parse(self, response): for report in response.css("a"): href = report.css("::attr(href)").get() if href: yield { "title": report.css("::text").get(), "file_urls": [response.urljoin(href)], }
FilesPipeline expects a list in file_urls even for a single file URL.
ITEM_PIPELINES = { "scrapy.pipelines.files.FilesPipeline": 1, } FILES_STORE = "/root/sg-work/files-store"
Large crawls can fill the filesystem under FILES_STORE; choose a path on a volume with enough free space.
On Windows 11, set FILES_STORE to a path like C:\scrapy\files.
$ scrapy crawl reports -O reports.jl 2026-01-01 09:48:50 [scrapy.pipelines.files] DEBUG: File (downloaded): Downloaded file from <GET http://downloads.example.net:8000/downloads/report-2026-01.txt> referred in <None> 2026-01-01 09:48:50 [scrapy.extensions.feedexport] INFO: Stored jl feed (1 items) in: reports.jl
$ head -n 1 reports.jl
{"title": "Quarterly Report", "file_urls": ["http://downloads.example.net:8000/downloads/report-2026-01.txt"], "files": [{"url": "http://downloads.example.net:8000/downloads/report-2026-01.txt", "path": "full/89b226624cca0340bd900a6bb8c3f6080f4eebe5.txt", "checksum": "4caf9e6249dc2f4d5eae15e4b53f656c", "status": "downloaded"}]}
$ ls -1 /root/sg-work/files-store/full | head -n 2 89b226624cca0340bd900a6bb8c3f6080f4eebe5.txt ##### snipped #####
The stored filename is a hash by default; override file_path() in a custom pipeline to preserve original filenames.