Website owners tell web spiders such as Googlebot what can and can't be crawled on their websites usingrobots.txt file. The file resides on the root directory of a website and contains rules such as the following;

User-agent: *
Disallow: /secret
Disallow: password.txt

A good web spider will first read the robots.txt file and adhere to the rule, though it's actually not compulsory.

If you run a scrapy crawl command for a project, it will first look for the robots.txt file and abide by all the rules.

$ scrapy crawl myspider
2018-06-19 12:05:10 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapyproject)
---snipped---
2018-06-19 12:05:10 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
---snipped---
2018-06-19 12:05:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/robots.txt> (referer: None)
2018-06-19 12:05:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/> (referer: None)
---snipped---

You can ignore robots.txt for your Scrapy spider by using the ROBOTSTXT_OBEY option and set the value to False.

Steps to ignore robots.txt for Scrapy spiders:

  1. Crawl a website normally using scrapy crawl command for your project to use the default to adhere to robots.txt rules.
    $ crapy crawl spidername
  2. Use set option to set ROBOTSTXT_OBEY option to False when crawling to ignore robots.txt rules.
    $ crapy crawl --set=ROBOTSTXT_OBEY='False' spidername
  3. Open Scrapy's configuration file in your project folder using your favorite editor.
    $ vi scrapyproject/settings.py
  4. Look for the ROBOTSTXT_OBEY option.
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = True
  5. Set the value to False
    ROBOTSTXT_OBEY = False
  6. Scrapy should no longer check for robots.txt and your spider will crawl for everything regardless of what's defined in the robots.txt file.
Discuss the article:

Comment anonymously. Login not required.

Share!