Ignore robots.txt for Scrapy spiders

You'll notice that Scrapy will check for robots.txt file first whenever you run your Scrapy spiders and it will then obey whatever that is defined in the robots.txt.

$ scrapy crawl myspider
2018-06-19 12:05:10 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapyproject)
2018-06-19 12:05:10 [scrapy.middleware] INFO: Enabled downloader middlewares:
2018-06-19 12:05:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/robots.txt> (referer: None)
2018-06-19 12:05:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/> (referer: None)

You can ignore robots.txt for your spiders if you need to by following these steps;

  1. Open Scrapy's configuration file in your project folder using your favorite editor.
    $ vi scrapyproject/settings.py
  2. Look for the ROBOTSTXT_OBEY option.
    # Obey robots.txt rules
  3. Set the value to False
  4. Scrapy should no longer check for robots.txt and your spider will crawl regardless of what's defined in the robots.txt file.

You can also ignore robots.txt when manually crawling by using the set or -s option. It supersedes the previous method.

To manually ignore robots.txt for your spider, use the set option and set ROBOTSTXT_OBEY to False such as in the following example:

$ scrapy crawl --help | grep -A1 set\=
                        set/override setting (may be repeated)
$ scrapy crawl -s ROBOTSTXT_OBEY='False' spidername