Long-running Scrapy crawls often need to stop for deploys, rate limiting, or manual review, and restarting from the beginning can waste bandwidth, re-hit the same URLs, and delay the remaining queue.

The JOBDIR setting stores the crawl queue, duplicate-request filter, and any spider.state data on disk so the same spider can resume later. Because command-line settings have the highest precedence in Scrapy, a one-off run can enable persistence with -s JOBDIR=… without editing settings.py or baking the path into every spider.

Each job directory belongs to one spider run only, and it should stay on persistent storage that untrusted users cannot write to. Resume works only after a clean shutdown, queued requests can go stale if cookies expire or the spider logic changes, and requests that cannot be pickled will not survive a pause unless the spider is adjusted.

Steps to enable JOBDIR job persistence in Scrapy:

  1. Change to the Scrapy project root so the spider name and relative JOBDIR path resolve against the intended project.
    $ cd catalog_demo
  2. Start the spider with a unique job directory.
    $ scrapy crawl catalog -s JOBDIR=jobstate/catalog-1
    2026-04-16 05:48:21 [scrapy.crawler] INFO: Overridden settings:
    {'JOBDIR': 'jobstate/catalog-1',
    ##### snipped #####
    }
    2026-04-16 05:48:21 [scrapy.core.engine] INFO: Spider opened

    Command-line settings have the highest precedence in Scrapy, so -s JOBDIR=… overrides project or spider defaults for this run only. Use an absolute path when the crawl runs under a service manager or container and the working directory may vary.

    Do not reuse one JOBDIR path for a different spider or for concurrent runs of the same spider.

  3. Stop the crawl cleanly with Ctrl+C when the queue should be paused for later.
    ^C
    2026-04-16 05:48:33 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force
    2026-04-16 05:48:33 [scrapy.core.engine] INFO: Closing spider (shutdown)
    2026-04-16 05:48:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'finish_reason': 'shutdown',
     'item_scraped_count': 5,
     'scheduler/enqueued/disk': 25,
    ##### snipped #####
    }
    2026-04-16 05:48:35 [scrapy.core.engine] INFO: Spider closed (shutdown)

    Forced termination can corrupt the on-disk queue and prevent a reliable resume.

  4. Run the same crawl command again with the same job directory to resume the saved requests.
    $ scrapy crawl catalog -s JOBDIR=jobstate/catalog-1
    2026-04-16 05:48:43 [scrapy.core.scheduler] INFO: Resuming crawl (19 requests scheduled)
    2026-04-16 05:49:27 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'finish_reason': 'finished',
     'item_scraped_count': 19,
     'scheduler/dequeued/disk': 20,
    ##### snipped #####
    }
    2026-04-16 05:49:27 [scrapy.core.engine] INFO: Spider closed (finished)

    If the resumed crawl seems to skip pending work, enable SCHEDULER_DEBUG = True to log requests that could not be serialized into the job directory.

  5. Remove the saved job directory before starting a completely fresh crawl that should ignore the earlier queue and duplicate filter.
    $ rm -rf jobstate/catalog-1