There is a method for website owners to tell web spiders or robots what can and can't be crawled on their website which is by the use of robots.txt file. The file resides on the root directory of a website and contain contain rules such as the followings;

User-agent: *
Disallow: /secret
Disallow: password.txt

A good web spider will first read the robots.txt file and adhere to the rule, though it's actually not compulsory.

If you run a scrapy crawl command for a project, it will indeed first check the robots.txt file and abide by all the rules.

$ scrapy crawl myspider
2018-06-19 12:05:10 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapyproject)
---snipped---
2018-06-19 12:05:10 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
---snipped---
2018-06-19 12:05:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/robots.txt> (referer: None)
2018-06-19 12:05:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/> (referer: None)
---snipped---

You can ignore robots.txt for your Scrapy spider by using the ROBOTSTXT_OBEY option.

Steps to ignore robots.txt for Scrapy spiders:

  1. Crawl a website normally using scrapy crawl command for your project to use the default to adhere to robots.txt rules.
    $ crapy crawl  spidername
  2. Set ROBOTSTXT_OBEY option to False when crawling to ignore robots.txt rules.
    $ crapy crawl --set=ROBOTSTXT_OBEY='False' spidername
  3. Open Scrapy's configuration file in your project folder using your favorite editor.
    $ vi scrapyproject/settings.py
  4. Look for the ROBOTSTXT_OBEY option.
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = True
  5. Set the value to False
    ROBOTSTXT_OBEY = False
  6. Scrapy should no longer check for robots.txt and your spider will crawl regardless of what's defined in the robots.txt file.
Discuss the article:

Share your thoughts, suggest corrections or just say Hi. Login not required.

Share!