There is a method for website owners to tell web spiders or robots what can and can't be crawled on their website which is by the use of
robots.txt file. The file resides on the root directory of a website and contain contain rules such as the followings;
User-agent: * Disallow: /secret Disallow: password.txt
A good web spider will first read the
robots.txt file and adhere to the rule, though it's actually not compulsory.
If you run a
scrapy crawl command for a project, it will indeed first check the
robots.txt file and abide by all the rules.
$ scrapy crawl myspider 2018-06-19 12:05:10 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapyproject) ---snipped--- 2018-06-19 12:05:10 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', ---snipped--- 2018-06-19 12:05:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/robots.txt> (referer: None) 2018-06-19 12:05:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/> (referer: None) ---snipped---
You can ignore
robots.txt for your
Scrapy spider by using the
ROBOTSTXT_OBEY option and set the value to
Steps to ignore robots.txt for Scrapy spiders:
scrapy crawlcommand for your project to use the default to adhere to
$ crapy crawl spidername
Falsewhen crawling to ignore
$ crapy crawl --set=ROBOTSTXT_OBEY='False' spidername
Scrapy's configuration file in your project folder using your favorite editor.
$ vi scrapyproject/settings.py
# Obey robots.txt rules ROBOTSTXT_OBEY = True
ROBOTSTXT_OBEY = False
Scrapyshould no longer check for
robots.txtand your spider will crawl regardless of what's defined in the
Comment anonymously. Login not required.