Website owners tell web spiders such as Googlebot what can and can't be crawled on their websites using robots.txt file. The file resides on the root directory of a website and contains rules such as the following;

User-agent: *
Disallow: /secret
Disallow: password.txt

A good web spider will first read the robots.txt file and adhere to the rule, though it's actually not compulsory.

If you run a scrapy crawl command for a project, it will first look for the robots.txt file and abide by all the rules.

$ scrapy crawl myspider
2018-06-19 12:05:10 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapyproject)
---snipped---
2018-06-19 12:05:10 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
---snipped---
2018-06-19 12:05:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/robots.txt> (referer: None)
2018-06-19 12:05:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/> (referer: None)
---snipped---

You can ignore robots.txt for your Scrapy spider by using the ROBOTSTXT_OBEY option and set the value to False.

Steps to ignore robots.txt for Scrapy spiders:

  1. Open the robots.txt file of the website that you want to test your Scrapy spider (optional).
    $ curl https://www.simplified.guide/robots.txt
    User-agent: *
    Disallow: /*?do*
    Disallow: /*?mode*
    Disallow: /_detail/
    Disallow: /_export/
    Disallow: /talk/
    Disallow: /wiki/
    Disallow: /tag/
  2. Crawl a website normally using scrapy crawl command for your project to use the default to adhere to robots.txt rules.
    $ scrapy crawl simplifiedguide
    2022-01-09 06:22:41 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: simplifiedguide)
    2022-01-09 06:22:41 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.9.7 (default, Sep 10 2021, 14:59:43) - [GCC 11.2.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.3.2, Platform Linux-5.13.0-23-generic-x86_64-with-glibc2.34
    2022-01-09 06:22:41 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
    2022-01-09 06:22:41 [scrapy.crawler] INFO: Overridden settings:
    {'BOT_NAME': 'simplifiedguide',
     'NEWSPIDER_MODULE': 'simplifiedguide.spiders',
     'ROBOTSTXT_OBEY': True,
    ##### snipped
  3. Use set option to set ROBOTSTXT_OBEY option to False when crawling to ignore robots.txt rules.
    $ scrapy crawl --set=ROBOTSTXT_OBEY='False' simplifiedguide
    2021-12-22 07:52:27 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: simplifiedguide)
    2021-12-22 07:52:27 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.9.7 (default, Sep 10 2021, 14:59:43) - [GCC 11.2.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.3.2, Platform Linux-5.13.0-22-generic-x86_64-with-glibc2.34
    2021-12-22 07:52:27 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
    2021-12-22 07:52:27 [scrapy.crawler] INFO: Overridden settings:
    {'BOT_NAME': 'simplifiedguide',
     'NEWSPIDER_MODULE': 'simplifiedguide.spiders',
     'ROBOTSTXT_OBEY': 'False',
  4. Open Scrapy's configuration file in your project folder using your favorite editor.
    $ vi simplifiedguide/settings.py
  5. Go to the ROBOTSTXT_OBEY option.
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = True
  6. Set the value to False.
    ROBOTSTXT_OBEY = False
  7. Run the crawl command again without having to specify the ROBOTSTXT_OBEY option.
    $ scrapy crawl simplifiedguide
    2021-12-22 07:54:23 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: simplifiedguide)
    2021-12-22 07:54:23 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.9.7 (default, Sep 10 2021, 14:59:43) - [GCC 11.2.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.3.2, Platform Linux-5.13.0-22-generic-x86_64-with-glibc2.34
    2021-12-22 07:54:23 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
    2021-12-22 07:54:23 [scrapy.crawler] INFO: Overridden settings:
    {'BOT_NAME': 'simplifiedguide',
     'NEWSPIDER_MODULE': 'simplifiedguide.spiders',
     'SPIDER_MODULES': ['simplifiedguide.spiders']}
    2021-12-22 07:54:23 [scrapy.extensions.telnet] INFO: Telnet Password: f0602803840ef7ab

    .

Discuss the article:

Comment anonymously. Login not required.

Share!