Many datasets are published as downloadable attachments—PDF reports, spreadsheets, images—linked from an index page or listing. Downloading those files during a Scrapy crawl keeps each scraped item complete and avoids gaps caused by missing attachments.

Scrapy downloads files through the built-in FilesPipeline, which reads a list of URLs from an item field named file_urls by default. Each successful download is saved under the FILES_STORE location and a corresponding entry is added to the item’s files field, including the stored path and a content checksum.

File downloads increase bandwidth and disk I/O, so high concurrency can bottleneck a crawl or fill a filesystem unexpectedly. Some sites gate downloads behind cookies, short-lived signed URLs, or anti-bot checks, so file requests may need the same session context as the page request. Keep FILES_STORE outside the project directory (or exclude it from version control) to avoid committing large binaries.

Steps to download files with Scrapy:

  1. Open the spider file that extracts the download URL.
    $ vi downloads_demo/spiders/reports.py
  2. Yield items that include a file_urls list of absolute file URLs.
    import scrapy
     
    class ReportsSpider(scrapy.Spider):
        name = "reports"
        start_urls = ["http://downloads.example.net:8000/downloads/"]
     
        def parse(self, response):
            for report in response.css("a"):
                href = report.css("::attr(href)").get()
                if href:
                    yield {
                        "title": report.css("::text").get(),
                        "file_urls": [response.urljoin(href)],
                    }

    FilesPipeline expects a list in file_urls even for a single file URL.

  3. Configure FilesPipeline storage settings in settings.py.
    ITEM_PIPELINES = {
        "scrapy.pipelines.files.FilesPipeline": 1,
    }
     
    FILES_STORE = "/root/sg-work/files-store"

    Large crawls can fill the filesystem under FILES_STORE; choose a path on a volume with enough free space.

    On Windows 11, set FILES_STORE to a path like C:\scrapy\files.

  4. Run the spider with feed export enabled to save the files metadata.
    $ scrapy crawl reports -O reports.jl
    2026-01-01 09:48:50 [scrapy.pipelines.files] DEBUG: File (downloaded): Downloaded file from <GET http://downloads.example.net:8000/downloads/report-2026-01.txt> referred in <None>
    2026-01-01 09:48:50 [scrapy.extensions.feedexport] INFO: Stored jl feed (1 items) in: reports.jl
  5. Inspect the exported items to confirm a files entry was recorded for each download.
    $ head -n 1 reports.jl
    {"title": "Quarterly Report", "file_urls": ["http://downloads.example.net:8000/downloads/report-2026-01.txt"], "files": [{"url": "http://downloads.example.net:8000/downloads/report-2026-01.txt", "path": "full/89b226624cca0340bd900a6bb8c3f6080f4eebe5.txt", "checksum": "4caf9e6249dc2f4d5eae15e4b53f656c", "status": "downloaded"}]}
  6. List the FILES_STORE directory to confirm the downloaded file exists on disk.
    $ ls -1 /root/sg-work/files-store/full | head -n 2
    89b226624cca0340bd900a6bb8c3f6080f4eebe5.txt
    ##### snipped #####

    The stored filename is a hash by default; override file_path() in a custom pipeline to preserve original filenames.