Ignoring robots.txt is useful when a Scrapy project needs to crawl approved staging paths, QA fixtures, or owned content that is intentionally blocked from public bots. Turning the check off lets a spider follow the same links and requests that parser, item pipeline, and export tests depend on.
Scrapy enforces crawl rules through RobotsTxtMiddleware when ROBOTSTXT_OBEY is enabled. Current projects generated by scrapy startproject still set ROBOTSTXT_OBEY = True in settings.py, so the crawler requests /robots.txt and drops disallowed requests before the spider callback sees them.
Disabling the setting affects every spider that uses the project settings module, so it should be limited to targets with explicit permission and conservative delay or concurrency settings. For a one-off crawl, pass -s ROBOTSTXT_OBEY=False on the command line instead of leaving the project default disabled after the test run.
Steps to ignore robots.txt for Scrapy spiders:
- Open the Scrapy project settings file from the project root.
$ vi tutorial/settings.py
In a standard project layout, the settings module is usually <project_name>/settings.py.
- Locate the ROBOTSTXT_OBEY line in settings.py.
# Obey robots.txt rules ROBOTSTXT_OBEY = True
- Change ROBOTSTXT_OBEY to False and save the file.
ROBOTSTXT_OBEY = False
If the setting is missing, add it once in the project settings module instead of duplicating it across several spiders.
Leaving the project default disabled makes every spider in the project ignore published crawl rules until the setting is reverted.
- Check the effective setting value that the current project loads.
$ scrapy settings --get ROBOTSTXT_OBEY False
For a single crawl instead of a project-wide change, run scrapy crawl example -s ROBOTSTXT_OBEY=False.
- Run the spider with debug logging and confirm that RobotsTxtMiddleware is no longer in the downloader middleware list.
$ scrapy crawl example -s LOG_LEVEL=DEBUG 2026-04-16 06:16:18 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', ##### snipped ##### 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2026-04-16 06:16:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/> (referer: None) 2026-04-16 06:16:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1:8000/private/> (referer: http://127.0.0.1:8000/) 2026-04-16 06:16:20 [scrapy.core.scraper] DEBUG: Scraped from <200 http://127.0.0.1:8000/private/> {'url': 'http://127.0.0.1:8000/private/', 'status': 200}When robots.txt is still active, the log shows a request to /robots.txt and a Forbidden by robots.txt message instead of the private-page response.
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
