HTTP caching in Scrapy avoids re-downloading the same page every time a spider is tested, which shortens iteration cycles and reduces unnecessary load on the target site. That is especially useful while selectors, pagination, and item parsing are still being refined against a stable response.

Scrapy enables caching through HttpCacheMiddleware, which stores each request and response pair under the project data directory and looks them up again by request fingerprint. The default filesystem backend writes relative cache paths beneath .scrapy, and the default DummyPolicy replays the stored response until it expires.

Expiration, cache policy, and per-request bypasses matter because cached HTML can hide live content changes, login state, and one-time tokens. A non-zero HTTPCACHE_EXPIRATION_SECS value is safer for normal development, while meta={"dont_cache": True} should be used on requests that must always hit the live site.

Steps to enable HTTP cache in Scrapy:

  1. Open the project settings file from the Scrapy project root.
    $ vi demo/settings.py
  2. Add the HTTP cache settings to the project configuration.
    HTTPCACHE_ENABLED = True
    HTTPCACHE_DIR = "httpcache"
    HTTPCACHE_EXPIRATION_SECS = 600

    A relative cache directory is stored under .scrapy, so this setting writes the cache to .scrapy/httpcache inside the project.

  3. Run the spider once to download the live response and populate the cache.
    $ scrapy crawl catalog
    ##### snipped #####
    2026-04-16 06:19:49 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'httpcache/firsthand': 1,
     'httpcache/miss': 1,
     'httpcache/store': 1,
     'item_scraped_count': 2}
    ##### snipped #####

    httpcache/store confirms that the response was written into the cache backend during this crawl.

  4. Run the same spider again to confirm that Scrapy serves the stored response instead of downloading it again.
    $ scrapy crawl catalog
    ##### snipped #####
    2026-04-16 06:19:50 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'httpcache/hit': 1,
     'item_scraped_count': 2}
    ##### snipped #####

    httpcache/hit confirms that the middleware reused the stored response during the second crawl.

  5. List the cache root to confirm that Scrapy created a per-spider cache tree on disk.
    $ ls -1 .scrapy/httpcache
    catalog

    Each spider gets its own directory below .scrapy/httpcache, and each cached request is stored under a fingerprinted subdirectory inside that spider tree.

Notes

  • The default DummyPolicy ignores server cache headers and replays stored responses until they expire.
  • Switch to RFC2616Policy when Cache-Control, ETag, and Last-Modified headers should control freshness and revalidation.
  • Use an expiration value of 0 only when cached responses should never expire, such as offline replay or parser work against a fixed page snapshot.
  • Set dont_cache to True on login requests, CSRF-protected forms, and other session-bound pages that should always be fetched live.
  • Remove .scrapy/httpcache to clear stored responses and force the next crawl to download fresh copies.