Many datasets are published as downloadable attachments—PDF reports, spreadsheets, images—linked from an index page or listing. Downloading those files during a Scrapy crawl keeps each scraped item complete and avoids gaps caused by missing attachments.
Scrapy downloads files through the built-in FilesPipeline, which reads a list of URLs from an item field named file_urls by default. Each successful download is saved under the FILES_STORE location and a corresponding entry is added to the item’s files field, including the stored path and a content checksum.
File downloads increase bandwidth and disk I/O, so high concurrency can bottleneck a crawl or fill a filesystem unexpectedly. Some sites gate downloads behind cookies, short-lived signed URLs, or anti-bot checks, so file requests may need the same session context as the page request. Keep FILES_STORE outside the project directory (or exclude it from version control) to avoid committing large binaries.
Related: How to enable item pipelines in Scrapy
Related: How to set a download delay in Scrapy
Steps to download files with Scrapy:
- Open the spider file that extracts the download URL.
$ vi downloads_demo/spiders/reports.py
- Yield items that include a file_urls list of absolute file URLs.
import scrapy class ReportsSpider(scrapy.Spider): name = "reports" start_urls = ["http://downloads.example.net:8000/downloads/"] def parse(self, response): for report in response.css("a"): href = report.css("::attr(href)").get() if href: yield { "title": report.css("::text").get(), "file_urls": [response.urljoin(href)], }
FilesPipeline expects a list in file_urls even for a single file URL.
- Configure FilesPipeline storage settings in settings.py.
ITEM_PIPELINES = { "scrapy.pipelines.files.FilesPipeline": 1, } FILES_STORE = "/root/sg-work/files-store"
Large crawls can fill the filesystem under FILES_STORE; choose a path on a volume with enough free space.
On Windows 11, set FILES_STORE to a path like C:\scrapy\files.
- Run the spider with feed export enabled to save the files metadata.
$ scrapy crawl reports -O reports.jl 2026-01-01 09:48:50 [scrapy.pipelines.files] DEBUG: File (downloaded): Downloaded file from <GET http://downloads.example.net:8000/downloads/report-2026-01.txt> referred in <None> 2026-01-01 09:48:50 [scrapy.extensions.feedexport] INFO: Stored jl feed (1 items) in: reports.jl
- Inspect the exported items to confirm a files entry was recorded for each download.
$ head -n 1 reports.jl {"title": "Quarterly Report", "file_urls": ["http://downloads.example.net:8000/downloads/report-2026-01.txt"], "files": [{"url": "http://downloads.example.net:8000/downloads/report-2026-01.txt", "path": "full/89b226624cca0340bd900a6bb8c3f6080f4eebe5.txt", "checksum": "4caf9e6249dc2f4d5eae15e4b53f656c", "status": "downloaded"}]} - List the FILES_STORE directory to confirm the downloaded file exists on disk.
$ ls -1 /root/sg-work/files-store/full | head -n 2 89b226624cca0340bd900a6bb8c3f6080f4eebe5.txt ##### snipped #####
The stored filename is a hash by default; override file_path() in a custom pipeline to preserve original filenames.
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
