How to ignore robots.txt for Scrapy spiders

Ignoring robots.txt in Scrapy is useful when a crawl is running against content that is explicitly authorized for testing, QA, or internal extraction but is still blocked from public bots. Turning the robots check off allows the spider to request those pages instead of stopping at the published crawl policy.

Current Scrapy releases generate settings.py with ROBOTSTXT_OBEY = True, so RobotsTxtMiddleware requests /robots.txt and rejects disallowed URLs before the spider callback receives them. Setting ROBOTSTXT_OBEY to False disables that middleware for the project and lets the crawler follow those requests normally.

Disabling the project setting affects every spider that uses that settings module, so it should only be used for targets with explicit permission and conservative crawl limits. For a one-run exception instead of a permanent project change, pass -s ROBOTSTXT_OBEY=False on the crawl command and leave the project default in place.

Steps to ignore robots.txt for Scrapy spiders:

  1. Change to the Scrapy project root that contains scrapy.cfg.
    $ cd /srv/tutorial

    Project-aware commands such as scrapy settings and scrapy crawl read the active settings module from this directory.

  2. Open the project settings file.
    $ vi tutorial/settings.py
  3. Locate the generated robots policy in settings.py.
    tutorial/settings.py
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = True
  4. Set ROBOTSTXT_OBEY to False and save the file.
    tutorial/settings.py
    ROBOTSTXT_OBEY = False

    If the line is missing, add it once in the project settings module instead of repeating the same override across several spiders. Related: How to override Scrapy settings from the command line

    Leaving this project-level setting disabled makes every spider in the project ignore the target's published crawl rules until the value is changed back.

  5. Read the effective setting value that the project now loads.
    $ scrapy settings --get ROBOTSTXT_OBEY
    False

    For a single crawl instead of a project-wide change, run scrapy crawl example -s ROBOTSTXT_OBEY=False.

  6. Run the spider with debug logging and confirm that the previously disallowed page is fetched instead of blocked by robots.txt.
    $ scrapy crawl example -s LOG_LEVEL=DEBUG
    2026-04-22 07:11:09 [scrapy.middleware] INFO: Enabled downloader middlewares:
    ['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
     'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
    ##### snipped #####
     'scrapy.downloadermiddlewares.stats.DownloaderStats']
    2026-04-22 07:11:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/> (referer: None)
    2026-04-22 07:11:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/private/> (referer: http://127.0.0.1:8000/)
    2026-04-22 07:11:10 [scrapy.core.scraper] DEBUG: Scraped from <200 http://127.0.0.1:8000/private/>
    {'url': 'http://127.0.0.1:8000/private/', 'status': 200}

    When robots.txt is still active, the startup log includes scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware, Scrapy requests /robots.txt, and the disallowed URL is rejected with Forbidden by robots.txt instead of returning the page.