Scrapy can keep a crawl queue on disk so a long run does not need to restart from the beginning after a planned stop. That is useful when a spider needs to pause for a deployment, a maintenance window, or a temporary target-side block.
The JOBDIR setting stores the pending request queue, the duplicate-request filter, and any persisted spider.state data for one spider run. A command-line override such as -s JOBDIR=jobstate/catalog-1 enables that storage for a single crawl without editing settings.py.
Each job directory belongs to one run only and should stay on persistent storage that untrusted users cannot write to. Resume works only after a clean shutdown, queued requests can fail later if login cookies expire, and requests that cannot be serialized with pickle will not survive a pause unless the spider is adjusted.
Steps to enable JOBDIR job persistence in Scrapy:
- Change to the Scrapy project root before starting the crawl.
$ cd /srv/catalog_demo
Run the command from the directory that contains scrapy.cfg so scrapy crawl loads the intended project and the relative JOBDIR path lands in the expected place.
- Start the spider with a dedicated JOBDIR path.
$ scrapy crawl catalog -s JOBDIR=jobstate/catalog-1 2026-04-22 06:58:18 [scrapy.crawler] INFO: Overridden settings: {'JOBDIR': 'jobstate/catalog-1', ##### snipped ##### } 2026-04-22 06:58:20 [scrapy.core.engine] INFO: Spider openedCommand-line settings have the highest precedence in Scrapy, so -s JOBDIR=… overrides project or spider defaults for this run only. Use an absolute path when the crawl runs under a service manager, scheduler, or container and the working directory may vary.
Do not share one JOBDIR path between different spiders or between concurrent runs of the same spider.
- Stop the crawl cleanly with Ctrl+C when the pending queue should stay on disk for a later run.
^C 2026-04-22 06:58:23 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force 2026-04-22 06:58:23 [scrapy.core.engine] INFO: Closing spider (shutdown) 2026-04-22 06:58:24 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'finish_reason': 'shutdown', 'item_scraped_count': 4, 'scheduler/dequeued/disk': 5, 'scheduler/enqueued/disk': 21, ##### snipped ##### } 2026-04-22 06:58:24 [scrapy.core.engine] INFO: Spider closed (shutdown)Forced termination can corrupt the saved queue and leave the next resume incomplete or unusable.
- Run the same crawl command again with the same JOBDIR path to resume the saved queue.
$ scrapy crawl catalog -s JOBDIR=jobstate/catalog-1 2026-04-22 06:58:41 [scrapy.core.engine] INFO: Spider opened 2026-04-22 06:58:41 [scrapy.core.scheduler] INFO: Resuming crawl (16 requests scheduled) 2026-04-22 06:58:51 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'finish_reason': 'finished', 'item_scraped_count': 16, 'scheduler/dequeued/disk': 17, 'scheduler/enqueued/disk': 1, ##### snipped ##### } 2026-04-22 06:58:51 [scrapy.core.engine] INFO: Spider closed (finished)If some pending requests do not reappear after the resume, enable SCHEDULER_DEBUG = True so Scrapy logs requests that could not be serialized into the job directory. Related: How to use custom settings in Scrapy
- Remove the saved job directory before starting a fresh crawl that should ignore the earlier queue and duplicate filter.
$ rm -rf jobstate/catalog-1
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
