Website owners tell web spiders such as Googlebot what can and can't be crawled on their websites using robots.txt file. The file resides on the root directory of a website and contains rules such as the following;
User-agent: * Disallow: /secret Disallow: password.txt
A good web spider will first read the robots.txt file and adhere to the rule, though it's not compulsory.
Conducting a scrapy crawl command for a project will first look for the robots.txt file and abide by all the rules.
$ scrapy crawl simplified 2022-01-21 19:19:18 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: simplifiedguide) ##### snipped {'BOT_NAME': 'simplifiedguide', 'NEWSPIDER_MODULE': 'simplifiedguide.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['simplifiedguide.spiders']} ##### snipped 2022-01-21 19:19:18 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', ##### snipped 2022-01-21 19:19:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.simplified.guide/robots.txt> (referer: None)
You can ignore robots.txt for your Scrapy spider by using the ROBOTSTXT_OBEY option and set the value to False.
Steps to ignore robots.txt for Scrapy spiders:
- Open the robots.txt file of the website that you want to test your Scrapy spider (optional).
$ curl https://www.simplified.guide/robots.txt User-agent: * Disallow: /*?do* Disallow: /*?mode* Disallow: /_detail/ Disallow: /_export/ Disallow: /talk/ Disallow: /wiki/ Disallow: /tag/
- Crawl a website normally using scrapy crawl command for your project to use the default to adhere to robots.txt rules.
$ scrapy crawl simplified 2022-01-21 19:27:28 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: simplifiedguide) 2022-01-21 19:27:28 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 10 2021, 14:59:43) - [GCC 11.2.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1l 24 Aug 2021), cryptography 3.3.2, Platform Linux-5.13.0-27-generic-aarch64-with-glibc2.34 2022-01-21 19:27:28 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor 2022-01-21 19:27:28 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'simplifiedguide', 'NEWSPIDER_MODULE': 'simplifiedguide.spiders', 'ROBOTSTXT_OBEY': True, ##### snipped
- Use set option to set ROBOTSTXT_OBEY option to False when crawling to ignore robots.txt rules.
$ scrapy crawl --set=ROBOTSTXT_OBEY='False' simplified 2022-01-21 19:28:33 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: simplifiedguide) 2022-01-21 19:28:33 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 10 2021, 14:59:43) - [GCC 11.2.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1l 24 Aug 2021), cryptography 3.3.2, Platform Linux-5.13.0-27-generic-aarch64-with-glibc2.34 2022-01-21 19:28:33 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor 2022-01-21 19:28:33 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'simplifiedguide', 'NEWSPIDER_MODULE': 'simplifiedguide.spiders', 'ROBOTSTXT_OBEY': 'False',
- Open Scrapy's configuration file in your project folder using your favorite editor.
$ vi simplifiedguide/settings.py
- Go to the ROBOTSTXT_OBEY option.
# Obey robots.txt rules ROBOTSTXT_OBEY = True
- Set the value to False.
ROBOTSTXT_OBEY = False
- Run the crawl command again without having to specify the ROBOTSTXT_OBEY option.
$ scrapy crawl simplified 2022-01-21 19:29:53 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: simplifiedguide) 2022-01-21 19:29:53 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 10 2021, 14:59:43) - [GCC 11.2.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1l 24 Aug 2021), cryptography 3.3.2, Platform Linux-5.13.0-27-generic-aarch64-with-glibc2.34 2022-01-21 19:29:53 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor 2022-01-21 19:29:53 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'simplifiedguide', 'NEWSPIDER_MODULE': 'simplifiedguide.spiders', 'SPIDER_MODULES': ['simplifiedguide.spiders']} 2022-01-21 19:29:53 [scrapy.extensions.telnet] INFO: Telnet Password: 1590cbf195f138f6
Author: Mohd
Shakir Zakaria
Mohd Shakir Zakaria is an experienced cloud architect with a strong development and open-source advocacy background. He boasts multiple certifications in AWS, Red Hat, VMware, ITIL, and Linux, underscoring his expertise in cloud architecture and system administration.
Mohd Shakir Zakaria is an experienced cloud architect with a strong development and open-source advocacy background. He boasts multiple certifications in AWS, Red Hat, VMware, ITIL, and Linux, underscoring his expertise in cloud architecture and system administration.
Discuss the article:
Comment anonymously. Login not required.