Downloading files during a Scrapy crawl keeps each scraped item tied to the original attachment instead of leaving PDFs, CSV exports, archives, or text reports behind on a remote site. That matters when the attached file is part of the dataset, audit trail, or offline processing pipeline that the crawl is supposed to capture.
Scrapy handles this through FilesPipeline. When a spider yields an item with a file_urls list, the pipeline schedules those URLs through the normal downloader, stores each file under FILES_STORE, and writes a files result list back into the item with the original URL, stored path, checksum, and status. Current Scrapy releases still use file_urls and files as the default field names, and the default stored filename remains a URL-hash path under full/.
Relative download links must be converted with response.urljoin() before FilesPipeline can request them. A missing or unwritable FILES_STORE path leaves the pipeline disabled, and redirected attachment URLs still need MEDIA_ALLOW_REDIRECTS = True because media redirects are treated as failed downloads by default. Keep the file store out of version control because it contains the downloaded payloads, not just crawl metadata.
Related: How to enable item pipelines in Scrapy
Related: How to set a download delay in Scrapy
Steps to download files with Scrapy FilesPipeline:
- Change to the Scrapy project directory that contains scrapy.cfg.
$ cd /srv/scrapy/downloads_demo
- Open the project settings file.
$ vi downloads_demo/settings.py
- Enable FilesPipeline and set a writable FILES_STORE path.
ITEM_PIPELINES = { "scrapy.pipelines.files.FilesPipeline": 1, } FILES_STORE = "files-store"
FilesPipeline reads download URLs from file_urls and writes the per-file results to files unless those field names are customized.
If FILES_STORE is empty, unwritable, or points to the wrong filesystem, the crawl can finish without saving the attachments that the item depends on.
- Open the spider file that extracts the download link.
$ vi downloads_demo/spiders/reports.py
- Yield each attachment as an item with a file_urls list built from absolute file URLs.
import scrapy class ReportsSpider(scrapy.Spider): name = "reports" start_urls = ["http://downloads.example.net/"] def parse(self, response): for report in response.css("a"): href = report.css("::attr(href)").get() if href: yield { "title": report.css("::text").get(), "file_urls": [response.urljoin(href)], }
response.urljoin() converts a relative attachment path such as /downloads/report-2026-01.txt into a full URL that Scrapy can request.
- Run the spider and export the items so the files metadata is saved with each record.
$ scrapy crawl reports -O reports.jl 2026-04-22 07:25:08 [scrapy.utils.log] INFO: Scrapy 2.15.0 started (bot: downloads_demo) ##### snipped ##### 2026-04-22 07:25:10 [scrapy.middleware] INFO: Enabled item pipelines: ['scrapy.pipelines.files.FilesPipeline'] 2026-04-22 07:25:13 [scrapy.pipelines.files] DEBUG: File (downloaded): Downloaded file from <GET http://downloads.example.net/downloads/report-2026-01.txt> referred in <None> 2026-04-22 07:25:13 [scrapy.extensions.feedexport] INFO: Stored jl feed (1 items) in: reports.jl
If the file URL first returns 301 or 302, add MEDIA_ALLOW_REDIRECTS = True before this run or the media request is treated as failed.
- Open the exported feed to confirm Scrapy wrote the stored file metadata back into the item.
$ cat reports.jl {"title": "Quarterly Report", "file_urls": ["http://downloads.example.net/downloads/report-2026-01.txt"], "files": [{"url": "http://downloads.example.net/downloads/report-2026-01.txt", "path": "full/a5aa714f9b354b2568693cb1cff6205f51b42e45.txt", "checksum": "e131101e2aee6259fdce55e6ec878ca2", "status": "downloaded"}]}The path value is relative to FILES_STORE, not an absolute filesystem path.
- List the file store to confirm the downloaded attachment exists on disk.
$ ls files-store/full a5aa714f9b354b2568693cb1cff6205f51b42e45.txt
FilesPipeline hashes the source URL into the stored filename by default. Override file_path() in a custom pipeline when the original filename or a custom directory layout matters more than the default hash-based layout.
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
