Website owners tell web spiders such as
Googlebot what can and can't be crawled on their websites with the use of
robots.txt file. The file resides on the root directory of a website and contain contain rules such as the followings;
User-agent: * Disallow: /secret Disallow: password.txt
A good web spider will first read the
robots.txt file and adhere to the rule, though it's actually not compulsory.
If you run a
scrapy crawl command for a project, it will indeed first look for the
robots.txt file and abide by all the rules.
$ scrapy crawl myspider 2018-06-19 12:05:10 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapyproject) ---snipped--- 2018-06-19 12:05:10 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', ---snipped--- 2018-06-19 12:05:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/robots.txt> (referer: None) 2018-06-19 12:05:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/> (referer: None) ---snipped---
You can ignore
robots.txt for your
Scrapy spider by using the
ROBOTSTXT_OBEY option and set the value to
scrapy crawlcommand for your project to use the default to adhere to
$ crapy crawl spidername
setoption to set
Falsewhen crawling to ignore
$ crapy crawl --set=ROBOTSTXT_OBEY='False' spidername
Scrapy's configuration file in your project folder using your favorite editor.
$ vi scrapyproject/settings.py
# Obey robots.txt rules ROBOTSTXT_OBEY = True
ROBOTSTXT_OBEY = False
Scrapyshould no longer check for
robots.txtand your spider will crawl for everything regardless of what's defined in the
Comment anonymously. Login not required.