Downloading files during a Scrapy crawl keeps each scraped item tied to the original attachment instead of leaving PDFs, CSV exports, archives, or text reports behind on a remote site. That matters when the attached file is part of the dataset, audit trail, or offline processing pipeline that the crawl is supposed to capture.
Scrapy handles this through FilesPipeline. When a spider yields an item with a file_urls list, the pipeline schedules those URLs through the normal downloader, stores each file under FILES_STORE, and writes a files result list back into the item with the original URL, stored path, checksum, and status. Current Scrapy releases still use file_urls and files as the default field names, and the default stored filename remains a URL-hash path under full/.
Relative download links must be converted with response.urljoin() before FilesPipeline can request them. A missing or unwritable FILES_STORE path leaves the pipeline disabled, and redirected attachment URLs still need MEDIA_ALLOW_REDIRECTS = True because media redirects are treated as failed downloads by default. Keep the file store out of version control because it contains the downloaded payloads, not just crawl metadata.
Related: How to enable item pipelines in Scrapy
Related: How to set a download delay in Scrapy
$ cd /srv/scrapy/downloads_demo
$ vi downloads_demo/settings.py
ITEM_PIPELINES = { "scrapy.pipelines.files.FilesPipeline": 1, } FILES_STORE = "files-store"
FilesPipeline reads download URLs from file_urls and writes the per-file results to files unless those field names are customized.
If FILES_STORE is empty, unwritable, or points to the wrong filesystem, the crawl can finish without saving the attachments that the item depends on.
$ vi downloads_demo/spiders/reports.py
import scrapy class ReportsSpider(scrapy.Spider): name = "reports" start_urls = ["http://downloads.example.net/"] def parse(self, response): for report in response.css("a"): href = report.css("::attr(href)").get() if href: yield { "title": report.css("::text").get(), "file_urls": [response.urljoin(href)], }
response.urljoin() converts a relative attachment path such as /downloads/report-2026-01.txt into a full URL that Scrapy can request.
$ scrapy crawl reports -O reports.jl 2026-04-22 07:25:08 [scrapy.utils.log] INFO: Scrapy 2.15.0 started (bot: downloads_demo) ##### snipped ##### 2026-04-22 07:25:10 [scrapy.middleware] INFO: Enabled item pipelines: ['scrapy.pipelines.files.FilesPipeline'] 2026-04-22 07:25:13 [scrapy.pipelines.files] DEBUG: File (downloaded): Downloaded file from <GET http://downloads.example.net/downloads/report-2026-01.txt> referred in <None> 2026-04-22 07:25:13 [scrapy.extensions.feedexport] INFO: Stored jl feed (1 items) in: reports.jl
If the file URL first returns 301 or 302, add MEDIA_ALLOW_REDIRECTS = True before this run or the media request is treated as failed.
$ cat reports.jl
{"title": "Quarterly Report", "file_urls": ["http://downloads.example.net/downloads/report-2026-01.txt"], "files": [{"url": "http://downloads.example.net/downloads/report-2026-01.txt", "path": "full/a5aa714f9b354b2568693cb1cff6205f51b42e45.txt", "checksum": "e131101e2aee6259fdce55e6ec878ca2", "status": "downloaded"}]}
The path value is relative to FILES_STORE, not an absolute filesystem path.
$ ls files-store/full a5aa714f9b354b2568693cb1cff6205f51b42e45.txt
FilesPipeline hashes the source URL into the stored filename by default. Override file_path() in a custom pipeline when the original filename or a custom directory layout matters more than the default hash-based layout.