Ignoring robots.txt can be useful when validating a Scrapy spider against staging environments, QA targets, or owned sites where crawl rules are intentionally restrictive. Disabling the rules keeps focus on parsing, request scheduling, and data extraction without being blocked by disallow patterns meant for public bots.

The robots.txt file lives at the site root and contains User-agent and Disallow directives that polite crawlers use to decide which paths to avoid. When ROBOTSTXT_OBEY is enabled, Scrapy fetches /robots.txt for each domain and uses RobotsTxtMiddleware to filter requests that match the rules for the configured user agent.

Disabling ROBOTSTXT_OBEY prevents those rules from being applied, either temporarily for a single run via command-line settings or permanently via the project settings.py file. Skipping crawl rules can breach acceptable use policies and may trigger blocking or rate limiting, so only apply this to targets with explicit permission and conservative request rates.

Steps to ignore robots.txt for Scrapy spiders:

  1. Open the robots.txt file for the target site to review its crawl rules (optional).
    $ curl http://app.internal.example:8000/robots.txt
    User-agent: *
    Disallow: /private/
  2. Run the spider with the current project settings to see whether robots.txt is being consulted.
    $ scrapy crawl simplified -s LOG_LEVEL=DEBUG
    2026-01-01 08:29:20 [scrapy.utils.log] INFO: Scrapy 2.11.1 started (bot: simplifiedguide)
    ##### snipped #####
    {'BOT_NAME': 'simplifiedguide',
     'NEWSPIDER_MODULE': 'simplifiedguide.spiders',
     'ROBOTSTXT_OBEY': True,
     'SPIDER_MODULES': ['simplifiedguide.spiders']}
    ##### snipped #####
    2026-01-01 08:29:20 [scrapy.middleware] INFO: Enabled downloader middlewares:
    ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
    ##### snipped #####
    2026-01-01 08:29:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://app.internal.example:8000/robots.txt> (referer: None)
  3. Run the spider with ROBOTSTXT_OBEY disabled for a single crawl using the --set option.
    $ scrapy crawl simplified --set=ROBOTSTXT_OBEY=False
    2026-01-01 08:29:30 [scrapy.utils.log] INFO: Scrapy 2.11.1 started (bot: simplifiedguide)
    ##### snipped #####
    2026-01-01 08:29:30 [scrapy.crawler] INFO: Overridden settings:
    {'BOT_NAME': 'simplifiedguide',
     'NEWSPIDER_MODULE': 'simplifiedguide.spiders',
     'ROBOTSTXT_OBEY': 'False',
     'SPIDER_MODULES': ['simplifiedguide.spiders']}

    RobotsTxtMiddleware is not loaded when ROBOTSTXT_OBEY is disabled, so robots.txt rules are skipped for that run.

    Ignoring robots.txt without explicit permission can violate site terms and may trigger blocks such as HTTP 403 responses, CAPTCHAs, or IP bans.

  4. Open Scrapy's project settings file in an editor.
    $ vi simplifiedguide/settings.py
  5. Locate the ROBOTSTXT_OBEY setting.
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = True
  6. Set ROBOTSTXT_OBEY to False in the project settings.
    ROBOTSTXT_OBEY = False

    This change applies to all spiders in the project unless a spider overrides settings via custom_settings.

  7. Confirm the project-level ROBOTSTXT_OBEY value is now disabled.
    $ scrapy settings --get ROBOTSTXT_OBEY
    False