Website owners tell web spiders such as Googlebot what can and can't be crawled on their websites with the use of robots.txt file. The file resides on the root directory of a website and contain contain rules such as the followings;
User-agent: * Disallow: /secret Disallow: password.txt
A good web spider will first read the robots.txt file and adhere to the rule, though it's actually not compulsory.
If you run a scrapy crawl command for a project, it will indeed first look for the robots.txt file and abide by all the rules.
$ scrapy crawl myspider 2018-06-19 12:05:10 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapyproject) ---snipped--- 2018-06-19 12:05:10 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', ---snipped--- 2018-06-19 12:05:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/robots.txt> (referer: None) 2018-06-19 12:05:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/> (referer: None) ---snipped---
You can ignore robots.txt for your Scrapy spider by using the ROBOTSTXT_OBEY option and set the value to False.
$ crapy crawl spidername
$ crapy crawl --set=ROBOTSTXT_OBEY='False' spidername
$ vi scrapyproject/settings.py
# Obey robots.txt rules ROBOTSTXT_OBEY = True
ROBOTSTXT_OBEY = False
Comment anonymously. Login not required.