Website owners tell web spiders such as Googlebot what can and can't be crawled on their websites using robots.txt file. The file resides on the root directory of a website and contains rules such as the following;

User-agent: *
Disallow: /secret
Disallow: password.txt

A good web spider will first read the robots.txt file and adhere to the rule, though it's not compulsory.

Conducting a scrapy crawl command for a project will first look for the robots.txt file and abide by all the rules.

$ scrapy crawl simplified
2022-01-21 19:19:18 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: simplifiedguide)
##### snipped
{'BOT_NAME': 'simplifiedguide',
 'NEWSPIDER_MODULE': 'simplifiedguide.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['simplifiedguide.spiders']}
##### snipped
2022-01-21 19:19:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
##### snipped
2022-01-21 19:19:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.simplified.guide/robots.txt> (referer: None)

You can ignore robots.txt for your Scrapy spider by using the ROBOTSTXT_OBEY option and set the value to False.

Steps to ignore robots.txt for Scrapy spiders:

  1. Open the robots.txt file of the website that you want to test your Scrapy spider (optional).
    $ curl https://www.simplified.guide/robots.txt
    User-agent: *
    Disallow: /*?do*
    Disallow: /*?mode*
    Disallow: /_detail/
    Disallow: /_export/
    Disallow: /talk/
    Disallow: /wiki/
    Disallow: /tag/
  2. Crawl a website normally using scrapy crawl command for your project to use the default to adhere to robots.txt rules.
    $ scrapy crawl simplified
    2022-01-21 19:27:28 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: simplifiedguide)
    2022-01-21 19:27:28 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 10 2021, 14:59:43) - [GCC 11.2.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.3.2, Platform Linux-5.13.0-27-generic-aarch64-with-glibc2.34
    2022-01-21 19:27:28 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
    2022-01-21 19:27:28 [scrapy.crawler] INFO: Overridden settings:
    {'BOT_NAME': 'simplifiedguide',
     'NEWSPIDER_MODULE': 'simplifiedguide.spiders',
     'ROBOTSTXT_OBEY': True,
    ##### snipped
  3. Use set option to set ROBOTSTXT_OBEY option to False when crawling to ignore robots.txt rules.
    $ scrapy crawl --set=ROBOTSTXT_OBEY='False' simplified
    2022-01-21 19:28:33 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: simplifiedguide)
    2022-01-21 19:28:33 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 10 2021, 14:59:43) - [GCC 11.2.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.3.2, Platform Linux-5.13.0-27-generic-aarch64-with-glibc2.34
    2022-01-21 19:28:33 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
    2022-01-21 19:28:33 [scrapy.crawler] INFO: Overridden settings:
    {'BOT_NAME': 'simplifiedguide',
     'NEWSPIDER_MODULE': 'simplifiedguide.spiders',
     'ROBOTSTXT_OBEY': 'False',
  4. Open Scrapy's configuration file in your project folder using your favorite editor.
    $ vi simplifiedguide/settings.py
  5. Go to the ROBOTSTXT_OBEY option.
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = True
  6. Set the value to False.
    ROBOTSTXT_OBEY = False
  7. Run the crawl command again without having to specify the ROBOTSTXT_OBEY option.
    $ scrapy crawl simplified
    2022-01-21 19:29:53 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: simplifiedguide)
    2022-01-21 19:29:53 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 10 2021, 14:59:43) - [GCC 11.2.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.3.2, Platform Linux-5.13.0-27-generic-aarch64-with-glibc2.34
    2022-01-21 19:29:53 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
    2022-01-21 19:29:53 [scrapy.crawler] INFO: Overridden settings:
    {'BOT_NAME': 'simplifiedguide',
     'NEWSPIDER_MODULE': 'simplifiedguide.spiders',
     'SPIDER_MODULES': ['simplifiedguide.spiders']}
    2022-01-21 19:29:53 [scrapy.extensions.telnet] INFO: Telnet Password: 1590cbf195f138f6
Discuss the article:

Comment anonymously. Login not required.

Share!