Downloading files during a Scrapy crawl keeps each scraped item tied to the original attachment instead of leaving PDFs, CSV exports, archives, or text reports behind on a remote site. That matters when the attached file is part of the dataset, audit trail, or offline processing pipeline that the crawl is supposed to capture.

Scrapy handles this workflow through FilesPipeline. When a spider yields an item with a file_urls list, the pipeline schedules those URLs through the normal downloader, saves each file under FILES_STORE, and writes a files result list back into the item with the original URL, stored path, checksum, and download status. Current Scrapy releases still default those field names to file_urls and files, and downloaded files remain fresh for 90 days before FILES_EXPIRES makes Scrapy fetch them again.

A file crawl can fail even when the page scrape succeeds if the storage path is invalid, the download link stays relative, or the site redirects attachments through a signed or CDN-backed URL. FilesPipeline still ignores media redirects by default, so MEDIA_ALLOW_REDIRECTS = True is required when a valid attachment first responds with 301 or 302. Keep FILES_STORE out of version control because the pipeline writes downloaded payloads, not just item metadata.

Steps to download files with Scrapy FilesPipeline:

  1. Change to the Scrapy project directory that contains scrapy.cfg.
    $ cd /srv/scrapy/downloads_demo
  2. Open the project settings file.
    $ vi downloads_demo/settings.py
  3. Enable FilesPipeline and set a writable FILES_STORE path.
    ITEM_PIPELINES = {
        "scrapy.pipelines.files.FilesPipeline": 1,
    }
     
    FILES_STORE = "files-store"

    FilesPipeline reads download URLs from file_urls and writes the per-file results to files unless those field names are customized.

    If FILES_STORE is missing, unwritable, or points to the wrong filesystem, the crawl can finish without saving the attachments that the page depends on.

  4. Open the spider file that extracts the download link.
    $ vi downloads_demo/spiders/reports.py
  5. Yield each attachment as an item with a file_urls list of absolute file URLs.
    import scrapy
     
    class ReportsSpider(scrapy.Spider):
        name = "reports"
        start_urls = ["http://downloads.example.net:8000/"]
     
        def parse(self, response):
            for report in response.css("a"):
                href = report.css("::attr(href)").get()
                if href:
                    yield {
                        "title": report.css("::text").get(),
                        "file_urls": [response.urljoin(href)],
                    }

    response.urljoin() converts a relative attachment path such as /downloads/report-2026-01.txt into a full URL that Scrapy can request.

  6. Run the spider and export the items so the files metadata is saved with each record.
    $ scrapy crawl reports -O reports.jl
    2026-04-16 05:59:30 [scrapy.utils.log] INFO: Scrapy 2.15.0 started (bot: downloads_demo)
    ##### snipped #####
    2026-04-16 05:59:30 [scrapy.middleware] INFO: Enabled item pipelines:
    ['scrapy.pipelines.files.FilesPipeline']
    2026-04-16 05:59:33 [scrapy.pipelines.files] DEBUG: File (downloaded): Downloaded file from <GET http://downloads.example.net:8000/downloads/report-2026-01.txt> referred in <None>
    2026-04-16 05:59:33 [scrapy.extensions.feedexport] INFO: Stored jl feed (1 items) in: reports.jl

    If the file URL redirects to a CDN or signed attachment endpoint, add MEDIA_ALLOW_REDIRECTS = True before this run or the media request is treated as failed.

  7. Open the exported feed to confirm Scrapy wrote the stored file metadata back into the item.
    $ cat reports.jl
    {"title": "Quarterly Report", "file_urls": ["http://downloads.example.net:8000/downloads/report-2026-01.txt"], "files": [{"url": "http://downloads.example.net:8000/downloads/report-2026-01.txt", "path": "full/15c873d201d986143a26397436f3f98988eb70e5.txt", "checksum": "bdeb2b084547e620413131a43bdd0523", "status": "downloaded"}]}

    The path value is relative to FILES_STORE, not an absolute filesystem path.

  8. List the file store to confirm the downloaded attachment exists on disk.
    $ ls files-store/full
    15c873d201d986143a26397436f3f98988eb70e5.txt

    FilesPipeline hashes the source URL into the stored filename by default. Override file_path() in a custom pipeline when original filenames or a custom directory layout matter.