How to enable JOBDIR job persistence in Scrapy

Scrapy can keep a crawl queue on disk so a long run does not need to restart from the beginning after a planned stop. That is useful when a spider needs to pause for a deployment, a maintenance window, or a temporary target-side block.

The JOBDIR setting stores the pending request queue, the duplicate-request filter, and any persisted spider.state data for one spider run. A command-line override such as -s JOBDIR=jobstate/catalog-1 enables that storage for a single crawl without editing settings.py.

Each job directory belongs to one run only and should stay on persistent storage that untrusted users cannot write to. Resume works only after a clean shutdown, queued requests can fail later if login cookies expire, and requests that cannot be serialized with pickle will not survive a pause unless the spider is adjusted.

Steps to enable JOBDIR job persistence in Scrapy:

  1. Change to the Scrapy project root before starting the crawl.
    $ cd /srv/catalog_demo

    Run the command from the directory that contains scrapy.cfg so scrapy crawl loads the intended project and the relative JOBDIR path lands in the expected place.

  2. Start the spider with a dedicated JOBDIR path.
    $ scrapy crawl catalog -s JOBDIR=jobstate/catalog-1
    2026-04-22 06:58:18 [scrapy.crawler] INFO: Overridden settings:
    {'JOBDIR': 'jobstate/catalog-1',
    ##### snipped #####
    }
    2026-04-22 06:58:20 [scrapy.core.engine] INFO: Spider opened

    Command-line settings have the highest precedence in Scrapy, so -s JOBDIR=… overrides project or spider defaults for this run only. Use an absolute path when the crawl runs under a service manager, scheduler, or container and the working directory may vary.

    Do not share one JOBDIR path between different spiders or between concurrent runs of the same spider.

  3. Stop the crawl cleanly with Ctrl+C when the pending queue should stay on disk for a later run.
    ^C
    2026-04-22 06:58:23 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force
    2026-04-22 06:58:23 [scrapy.core.engine] INFO: Closing spider (shutdown)
    2026-04-22 06:58:24 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'finish_reason': 'shutdown',
     'item_scraped_count': 4,
     'scheduler/dequeued/disk': 5,
     'scheduler/enqueued/disk': 21,
    ##### snipped #####
    }
    2026-04-22 06:58:24 [scrapy.core.engine] INFO: Spider closed (shutdown)

    Forced termination can corrupt the saved queue and leave the next resume incomplete or unusable.

  4. Run the same crawl command again with the same JOBDIR path to resume the saved queue.
    $ scrapy crawl catalog -s JOBDIR=jobstate/catalog-1
    2026-04-22 06:58:41 [scrapy.core.engine] INFO: Spider opened
    2026-04-22 06:58:41 [scrapy.core.scheduler] INFO: Resuming crawl (16 requests scheduled)
    2026-04-22 06:58:51 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'finish_reason': 'finished',
     'item_scraped_count': 16,
     'scheduler/dequeued/disk': 17,
     'scheduler/enqueued/disk': 1,
    ##### snipped #####
    }
    2026-04-22 06:58:51 [scrapy.core.engine] INFO: Spider closed (finished)

    If some pending requests do not reappear after the resume, enable SCHEDULER_DEBUG = True so Scrapy logs requests that could not be serialized into the job directory. Related: How to use custom settings in Scrapy

  5. Remove the saved job directory before starting a fresh crawl that should ignore the earlier queue and duplicate filter.
    $ rm -rf jobstate/catalog-1