Website owners tell web spiders such as Googlebot what can and can't be crawled on their websites with the use of robots.txt file. The file resides on the root directory of a website and contain contain rules such as the followings;

User-agent: *
Disallow: /secret
Disallow: password.txt

A good web spider will first read the robots.txt file and adhere to the rule, though it's actually not compulsory.

If you run a scrapy crawl command for a project, it will indeed first look for the robots.txt file and abide by all the rules.

$ scrapy crawl myspider
2018-06-19 12:05:10 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapyproject)
---snipped---
2018-06-19 12:05:10 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
---snipped---
2018-06-19 12:05:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/robots.txt> (referer: None)
2018-06-19 12:05:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/> (referer: None)
---snipped---

You can ignore robots.txt for your Scrapy spider by using the ROBOTSTXT_OBEY option and set the value to False.

Steps to ignore robots.txt for Scrapy spiders:

  1. Crawl a website normally using scrapy crawl command for your project to use the default to adhere to robots.txt rules.
    $ crapy crawl spidername
  2. Use set option to set ROBOTSTXT_OBEY option to False when crawling to ignore robots.txt rules.
    $ crapy crawl --set=ROBOTSTXT_OBEY='False' spidername
  3. Open Scrapy's configuration file in your project folder using your favorite editor.
    $ vi scrapyproject/settings.py
  4. Look for the ROBOTSTXT_OBEY option.
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = True
  5. Set the value to False
    ROBOTSTXT_OBEY = False
  6. Scrapy should no longer check for robots.txt and your spider will crawl for everything regardless of what's defined in the robots.txt file.
Share this guide!
Discuss the article:

Comment anonymously. Login not required.

Share!