Ignoring robots.txt is useful when a Scrapy project needs to crawl approved staging paths, QA fixtures, or owned content that is intentionally blocked from public bots. Turning the check off lets a spider follow the same links and requests that parser, item pipeline, and export tests depend on.

Scrapy enforces crawl rules through RobotsTxtMiddleware when ROBOTSTXT_OBEY is enabled. Current projects generated by scrapy startproject still set ROBOTSTXT_OBEY = True in settings.py, so the crawler requests /robots.txt and drops disallowed requests before the spider callback sees them.

Disabling the setting affects every spider that uses the project settings module, so it should be limited to targets with explicit permission and conservative delay or concurrency settings. For a one-off crawl, pass -s ROBOTSTXT_OBEY=False on the command line instead of leaving the project default disabled after the test run.

Steps to ignore robots.txt for Scrapy spiders:

  1. Open the Scrapy project settings file from the project root.
    $ vi tutorial/settings.py

    In a standard project layout, the settings module is usually <project_name>/settings.py.

  2. Locate the ROBOTSTXT_OBEY line in settings.py.
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = True
  3. Change ROBOTSTXT_OBEY to False and save the file.
    ROBOTSTXT_OBEY = False

    If the setting is missing, add it once in the project settings module instead of duplicating it across several spiders.

    Leaving the project default disabled makes every spider in the project ignore published crawl rules until the setting is reverted.

  4. Check the effective setting value that the current project loads.
    $ scrapy settings --get ROBOTSTXT_OBEY
    False

    For a single crawl instead of a project-wide change, run scrapy crawl example -s ROBOTSTXT_OBEY=False.

  5. Run the spider with debug logging and confirm that RobotsTxtMiddleware is no longer in the downloader middleware list.
    $ scrapy crawl example -s LOG_LEVEL=DEBUG
    2026-04-16 06:16:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
    ['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
     'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
    ##### snipped #####
     'scrapy.downloadermiddlewares.stats.DownloaderStats']
    2026-04-16 06:16:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/> (referer: None)
    2026-04-16 06:16:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/private/> (referer: http://127.0.0.1:8000/)
    2026-04-16 06:16:20 [scrapy.core.scraper] DEBUG: Scraped from <200 http://127.0.0.1:8000/private/>
    {'url': 'http://127.0.0.1:8000/private/', 'status': 200}

    When robots.txt is still active, the log shows a request to /robots.txt and a Forbidden by robots.txt message instead of the private-page response.