Ignoring the /robots.txt file in Wget enables full‑site crawling and archiving, including paths that polite crawlers normally skip. This behavior is often required for internal compliance audits, offline mirrors, and controlled testing of how sensitive content is exposed.
Under the Robots Exclusion Protocol, web servers publish a /robots.txt file at the site root, and clients that implement the protocol adjust which URLs they fetch based on user‑agent rules. Wget honours these rules when recursive retrieval is enabled by first requesting /robots.txt, then filtering links unless the internal robots variable is explicitly disabled through command‑line options or configuration files.
Disabling this safeguard bypasses the site owner’s published crawling preferences and can add significant load or violate acceptable‑use policies, even if technically possible. The commands below assume access is authorised and focus on Wget running in a shell on Linux, showing both a one‑off override and a configuration change that permanently turns off robots handling for a single user.
Steps to ignore robots.txt in Wget:
- Open a terminal on Linux with standard user privileges.
$ whoami userRunning Wget as an unprivileged account reduces the blast radius if an unexpected path is fetched or a misconfiguration causes excessive downloads.
- Ignore /robots.txt for a single recursive crawl by passing the robots variable on the command line.
$ wget --execute=robots=off --recursive https://www.example.com/ --2025-01-01 12:00:00-- https://www.example.com/ Resolving www.example.com (www.example.com)... 93.184.216.34 Connecting to www.example.com (www.example.com)|93.184.216.34|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 1256 (1.2K) [text/html] Saving to: ‘index.html’ index.html 100%[====================] 1.23K --.-KB/s in 0.01s ##### snipped #####
The --execute=robots=off option sets the internal robots variable for this invocation only while leaving global configuration unchanged.
- Enable persistent ignoring of /robots.txt by adding a robots setting to the per‑user configuration file.
$ printf 'robots = off\n' >> ~/.wgetrc
Permanently disabling robots handling for a user can breach site policies, increase load on fragile servers, and may trigger IP‑level blocking or legal complaints from administrators.
- Confirm that the robots setting is present in the per‑user configuration.
$ grep -i '^robots' ~/.wgetrc robots = off
If multiple robots entries exist in ~/.wgetrc, the final line is the effective value that Wget applies during downloads.
- Verify that recursive downloads now ignore /robots.txt without specifying the execute option explicitly.
$ wget --recursive https://www.example.com/ --2025-01-01 12:05:00-- https://www.example.com/ Resolving www.example.com (www.example.com)... 93.184.216.34 Connecting to www.example.com (www.example.com)|93.184.216.34|:443... connected. HTTP request sent, awaiting response... 200 OK ##### snipped #####
Successful access to URLs that are disallowed in the site’s /robots.txt, subject to any additional authentication or IP restrictions, indicates that robot exclusion is no longer being honoured for this configuration.
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
Comment anonymously. Login not required.
