How to ignore robots.txt for Scrapy spiders

Website owners tell web spiders such as Googlebot what can and can't be crawled on their websites using robots.txt file. The file resides on the root directory of a website and contains rules such as the following;

User-agent: *
Disallow: /secret
Disallow: password.txt

A good web spider will first read the robots.txt file and adhere to the rule, though it's not compulsory.

Conducting a scrapy crawl command for a project will first look for the robots.txt file and abide by all the rules.

$ scrapy crawl simplified
2022-01-21 19:19:18 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: simplifiedguide)
##### snipped
{'BOT_NAME': 'simplifiedguide',
 'NEWSPIDER_MODULE': 'simplifiedguide.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['simplifiedguide.spiders']}
##### snipped
2022-01-21 19:19:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
##### snipped
2022-01-21 19:19:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.simplified.guide/robots.txt> (referer: None)

You can ignore robots.txt for your Scrapy spider by using the ROBOTSTXT_OBEY option and set the value to False.

Steps to ignore robots.txt for Scrapy spiders:

Open the robots.txt file of the website that you want to test your Scrapy spider (optional).

$ curl https://www.simplified.guide/robots.txt
User-agent: *
Disallow: /*?do*
Disallow: /*?mode*
Disallow: /_detail/
Disallow: /_export/
Disallow: /talk/
Disallow: /wiki/
Disallow: /tag/

Crawl a website normally using scrapy crawl command for your project to use the default to adhere to robots.txt rules.

$ scrapy crawl simplified
2022-01-21 19:27:28 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: simplifiedguide)
2022-01-21 19:27:28 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 10 2021, 14:59:43) - [GCC 11.2.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.3.2, Platform Linux-5.13.0-27-generic-aarch64-with-glibc2.34
2022-01-21 19:27:28 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-01-21 19:27:28 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'simplifiedguide',
 'NEWSPIDER_MODULE': 'simplifiedguide.spiders',
 'ROBOTSTXT_OBEY': True,
##### snipped

Use set option to set ROBOTSTXT_OBEY option to False when crawling to ignore robots.txt rules.

$ scrapy crawl --set=ROBOTSTXT_OBEY='False' simplified
2022-01-21 19:28:33 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: simplifiedguide)
2022-01-21 19:28:33 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 10 2021, 14:59:43) - [GCC 11.2.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.3.2, Platform Linux-5.13.0-27-generic-aarch64-with-glibc2.34
2022-01-21 19:28:33 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-01-21 19:28:33 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'simplifiedguide',
 'NEWSPIDER_MODULE': 'simplifiedguide.spiders',
 'ROBOTSTXT_OBEY': 'False',

Open Scrapy's configuration file in your project folder using your favorite editor.
```
$ vi simplifiedguide/settings.py
```

Go to the ROBOTSTXT_OBEY option.

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

Set the value to False.
```
ROBOTSTXT_OBEY = False
```

Run the crawl command again without having to specify the ROBOTSTXT_OBEY option.

$ scrapy crawl simplified
2022-01-21 19:29:53 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: simplifiedguide)
2022-01-21 19:29:53 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 10 2021, 14:59:43) - [GCC 11.2.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.3.2, Platform Linux-5.13.0-27-generic-aarch64-with-glibc2.34
2022-01-21 19:29:53 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-01-21 19:29:53 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'simplifiedguide',
 'NEWSPIDER_MODULE': 'simplifiedguide.spiders',
 'SPIDER_MODULES': ['simplifiedguide.spiders']}
2022-01-21 19:29:53 [scrapy.extensions.telnet] INFO: Telnet Password: 1590cbf195f138f6

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.

Discuss the article:

Comment anonymously. Login not required.