Website owners tell web spiders such as Googlebot what can and can't be crawled on their websites usingrobots.txt file. The file resides on the root directory of a website and contains rules such as the following;
User-agent: * Disallow: /secret Disallow: password.txt
A good web spider will first read the robots.txt file and adhere to the rule, though it's actually not compulsory.
If you run a scrapy crawl command for a project, it will first look for the robots.txt file and abide by all the rules.
$ scrapy crawl myspider 2018-06-19 12:05:10 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapyproject) ---snipped--- 2018-06-19 12:05:10 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', ---snipped--- 2018-06-19 12:05:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/robots.txt> (referer: None) 2018-06-19 12:05:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/> (referer: None) ---snipped---
You can ignore robots.txt for your Scrapy spider by using the ROBOTSTXT_OBEY option and set the value to False.
$ crapy crawl spidername
$ crapy crawl --set=ROBOTSTXT_OBEY='False' spidername
$ vi scrapyproject/settings.py
# Obey robots.txt rules ROBOTSTXT_OBEY = True
ROBOTSTXT_OBEY = False
Comment anonymously. Login not required.