You'll notice that
Scrapy will check for
robots.txt file first whenever you run your
Scrapy spiders and it will then obey whatever that is defined in the robots.txt.
$ scrapy crawl myspider 2018-06-19 12:05:10 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapyproject) ---snipped--- 2018-06-19 12:05:10 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', ---snipped--- 2018-06-19 12:05:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/robots.txt> (referer: None) 2018-06-19 12:05:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/> (referer: None) ---snipped---
You can ignore
robots.txt for your spiders if you need to by following these steps;
Scrapy's configuration file in your project folder using your favorite editor.
$ vi scrapyproject/settings.py
# Obey robots.txt rules ROBOTSTXT_OBEY = True
ROBOTSTXT_OBEY = False
Scrapyshould no longer check for
robots.txtand your spider will crawl regardless of what's defined in the
You can also ignore
robots.txt when manually crawling by using the
-s option. It supersedes the previous method.
To manually ignore
robots.txt for your spider, use the
set option and set
False such as in the following example:
$ scrapy crawl --help | grep -A1 set\= --set=NAME=VALUE, -s NAME=VALUE set/override setting (may be repeated) $ scrapy crawl -s ROBOTSTXT_OBEY='False' spidername