How to enable JOBDIR job persistence in Scrapy

Long-running Scrapy crawls can be interrupted by deploy restarts, network failures, or rate limiting, and restarting from scratch can waste hours of work and repeat requests against the same targets.

Setting JOBDIR tells Scrapy to persist scheduler queues and duplicate-filter state to disk. When the same spider starts again with the same job directory, Scrapy reloads the saved state and continues processing the remaining pending requests.

Resume state is coupled to the spider's request generation and settings, so changing start URLs, link extraction logic, middlewares, or concurrency between runs can lead to duplicates, missed pages, or errors. Use a unique job directory per spider or target, keep it on persistent storage, and delete it before starting a clean crawl with no resume data.

Steps to enable JOBDIR job persistence in Scrapy:

Open the Scrapy project settings file.
```
$ vi simplifiedguide/settings.py
```
Set the JOBDIR path for persisted spider state.
```
JOBDIR = "jobstate/catalog"
```
A relative path is created under the current working directory, while an absolute path such as /var/lib/scrapy/jobstate/catalog is safer for service-managed runs. Avoid sharing the same JOBDIR between concurrent spider processes.

Confirm the JOBDIR value loaded by Scrapy.

$ scrapy settings --get JOBDIR
jobstate/catalog

Run the spider with the new JOBDIR setting.

$ scrapy crawl catalog
2026-01-01 08:32:05 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.spiderstate.SpiderState',
##### snipped #####]

Stop the crawl with Ctrl+C to preserve the job state.

Force-killing the process can leave partially written scheduler state and reduce resume reliability.

Run the spider again to resume from the saved state.

$ scrapy crawl catalog
2026-01-01 08:33:26 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'scheduler/dequeued/disk': 21,
 'scheduler/enqueued/disk': 21,
##### snipped #####
}

Remove the JOBDIR directory to start a fresh crawl without resuming.
```
$ rm -rf jobstate/catalog
```
Deleting the job directory discards the pending request queue and prevents resuming the interrupted crawl.

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.