Ignoring robots.txt can be useful when validating a Scrapy spider against staging environments, QA targets, or owned sites where crawl rules are intentionally restrictive. Disabling the rules keeps focus on parsing, request scheduling, and data extraction without being blocked by disallow patterns meant for public bots.
The robots.txt file lives at the site root and contains User-agent and Disallow directives that polite crawlers use to decide which paths to avoid. When ROBOTSTXT_OBEY is enabled, Scrapy fetches /robots.txt for each domain and uses RobotsTxtMiddleware to filter requests that match the rules for the configured user agent.
Disabling ROBOTSTXT_OBEY prevents those rules from being applied, either temporarily for a single run via command-line settings or permanently via the project settings.py file. Skipping crawl rules can breach acceptable use policies and may trigger blocking or rate limiting, so only apply this to targets with explicit permission and conservative request rates.
$ curl http://app.internal.example:8000/robots.txt User-agent: * Disallow: /private/
$ scrapy crawl simplified -s LOG_LEVEL=DEBUG
2026-01-01 08:29:20 [scrapy.utils.log] INFO: Scrapy 2.11.1 started (bot: simplifiedguide)
##### snipped #####
{'BOT_NAME': 'simplifiedguide',
'NEWSPIDER_MODULE': 'simplifiedguide.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['simplifiedguide.spiders']}
##### snipped #####
2026-01-01 08:29:20 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
##### snipped #####
2026-01-01 08:29:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://app.internal.example:8000/robots.txt> (referer: None)
$ scrapy crawl simplified --set=ROBOTSTXT_OBEY=False
2026-01-01 08:29:30 [scrapy.utils.log] INFO: Scrapy 2.11.1 started (bot: simplifiedguide)
##### snipped #####
2026-01-01 08:29:30 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'simplifiedguide',
'NEWSPIDER_MODULE': 'simplifiedguide.spiders',
'ROBOTSTXT_OBEY': 'False',
'SPIDER_MODULES': ['simplifiedguide.spiders']}
RobotsTxtMiddleware is not loaded when ROBOTSTXT_OBEY is disabled, so robots.txt rules are skipped for that run.
Ignoring robots.txt without explicit permission can violate site terms and may trigger blocks such as HTTP 403 responses, CAPTCHAs, or IP bans.
$ vi simplifiedguide/settings.py
# Obey robots.txt rules ROBOTSTXT_OBEY = True
ROBOTSTXT_OBEY = False
This change applies to all spiders in the project unless a spider overrides settings via custom_settings.
$ scrapy settings --get ROBOTSTXT_OBEY False