Website owners tell web spiders such as Googlebot
what can and can't be crawled on their websites with the use of robots.txt
file. The file resides on the root directory of a website and contain contain rules such as the followings;
User-agent: * Disallow: /secret Disallow: password.txt
A good web spider will first read the robots.txt
file and adhere to the rule, though it's actually not compulsory.
If you run a scrapy crawl
command for a project, it will indeed first look for the robots.txt
file and abide by all the rules.
$ scrapy crawl myspider 2018-06-19 12:05:10 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapyproject) ---snipped--- 2018-06-19 12:05:10 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', ---snipped--- 2018-06-19 12:05:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/robots.txt> (referer: None) 2018-06-19 12:05:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/> (referer: None) ---snipped---
You can ignore robots.txt
for your Scrapy
spider by using the ROBOTSTXT_OBEY
option and set the value to False
.
scrapy crawl
command for your project to use the default to adhere to robots.txt
rules. $ crapy crawl spidername
set
option to set ROBOTSTXT_OBEY
option to False
when crawling to ignore robots.txt
rules.$ crapy crawl --set=ROBOTSTXT_OBEY='False' spidername
Scrapy
's configuration file in your project folder using your favorite editor. $ vi scrapyproject/settings.py
ROBOTSTXT_OBEY
option. # Obey robots.txt rules ROBOTSTXT_OBEY = True
False
ROBOTSTXT_OBEY = False
Scrapy
should no longer check for robots.txt
and your spider will crawl for everything regardless of what's defined in the robots.txt
file.Comment anonymously. Login not required.