How to download files with Scrapy

Many datasets are published as downloadable attachments—PDF reports, spreadsheets, images—linked from an index page or listing. Downloading those files during a Scrapy crawl keeps each scraped item complete and avoids gaps caused by missing attachments.

Scrapy downloads files through the built-in FilesPipeline, which reads a list of URLs from an item field named file_urls by default. Each successful download is saved under the FILES_STORE location and a corresponding entry is added to the item’s files field, including the stored path and a content checksum.

File downloads increase bandwidth and disk I/O, so high concurrency can bottleneck a crawl or fill a filesystem unexpectedly. Some sites gate downloads behind cookies, short-lived signed URLs, or anti-bot checks, so file requests may need the same session context as the page request. Keep FILES_STORE outside the project directory (or exclude it from version control) to avoid committing large binaries.

Steps to download files with Scrapy:

Open the spider file that extracts the download URL.
```
$ vi downloads_demo/spiders/reports.py
```

Yield items that include a file_urls list of absolute file URLs.

import scrapy
 
class ReportsSpider(scrapy.Spider):
    name = "reports"
    start_urls = ["http://downloads.example.net:8000/downloads/"]
 
    def parse(self, response):
        for report in response.css("a"):
            href = report.css("::attr(href)").get()
            if href:
                yield {
                    "title": report.css("::text").get(),
                    "file_urls": [response.urljoin(href)],
                }

FilesPipeline expects a list in file_urls even for a single file URL.

Configure FilesPipeline storage settings in settings.py.
```
ITEM_PIPELINES = {
    "scrapy.pipelines.files.FilesPipeline": 1,
}
 
FILES_STORE = "/root/sg-work/files-store"
```
Large crawls can fill the filesystem under FILES_STORE; choose a path on a volume with enough free space.

On Windows 11, set FILES_STORE to a path like C:\scrapy\files.

Run the spider with feed export enabled to save the files metadata.

$ scrapy crawl reports -O reports.jl
2026-01-01 09:48:50 [scrapy.pipelines.files] DEBUG: File (downloaded): Downloaded file from <GET http://downloads.example.net:8000/downloads/report-2026-01.txt> referred in <None>
2026-01-01 09:48:50 [scrapy.extensions.feedexport] INFO: Stored jl feed (1 items) in: reports.jl

Inspect the exported items to confirm a files entry was recorded for each download.

$ head -n 1 reports.jl
{"title": "Quarterly Report", "file_urls": ["http://downloads.example.net:8000/downloads/report-2026-01.txt"], "files": [{"url": "http://downloads.example.net:8000/downloads/report-2026-01.txt", "path": "full/89b226624cca0340bd900a6bb8c3f6080f4eebe5.txt", "checksum": "4caf9e6249dc2f4d5eae15e4b53f656c", "status": "downloaded"}]}

List the FILES_STORE directory to confirm the downloaded file exists on disk.
```
$ ls -1 /root/sg-work/files-store/full | head -n 2
89b226624cca0340bd900a6bb8c3f6080f4eebe5.txt
##### snipped #####
```
The stored filename is a hash by default; override file_path() in a custom pipeline to preserve original filenames.

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.