Recursive wget jobs normally stop at paths excluded by /robots.txt/, which is the right default for mirrors, audits, and bulk retrievals that should stay polite. When an authorized capture has to include those paths, a temporary robots override lets the download proceed without redesigning the rest of the command.
During recursive retrieval wget fetches /robots.txt/ once per host and uses it to skip disallowed paths. It also honors document-level nofollow hints, and the -e robots=off form disables both robot-exclusion checks for one run while robots = off in /$HOME/.wgetrc/ makes the override persistent.
The override should stay narrowly scoped because it can expand both crawl depth and request volume very quickly. Keep recursion bounded, use only approved hosts or paths, and remove any persistent override as soon as the special capture is complete.
Steps to ignore robots.txt in wget:
- Fetch the published robots policy first so the excluded paths are documented before the override.
$ wget -qO- https://archive.example.net/robots.txt User-agent: * Disallow: /exports/internal/
Capturing the baseline policy makes it obvious which URLs the default crawl would skip. Related: How to mirror an entire website with wget
- Run the bounded recursive job once without an override and confirm that disallowed paths stay out of the result tree.
$ wget -r -l 1 --no-parent https://archive.example.net/exports/ Loading robots.txt; please ignore errors. Saving to: 'archive.example.net/exports/index.html' Saving to: 'archive.example.net/robots.txt' Saving to: 'archive.example.net/exports/public/status.html' Downloaded: 3 files
A baseline run makes the override easier to audit because the missing paths are known before anything changes.
- Re-run the same command with -e robots=off and verify that previously blocked paths are now downloaded.
$ wget -r -l 1 --no-parent -e robots=off https://archive.example.net/exports/ Saving to: 'archive.example.net/exports/index.html' Saving to: 'archive.example.net/exports/public/status.html' Saving to: 'archive.example.net/exports/internal/report.html' Downloaded: 3 files
The -e robots=off form applies to this command only, which makes it the safest choice for one audit or mirror job.
- Persist the override in the user startup file only when the same approved job must repeat under the same account.
~/.wgetrc robots = off
A persistent override changes every later recursive wget run for that user until it is removed or reset. Related: How to configure default options in ~/.wgetrc
- Confirm the active behavior from the downloaded tree before handing the result to another system.
$ find archive.example.net -type f | sort archive.example.net/exports/index.html archive.example.net/exports/internal/report.html archive.example.net/exports/public/status.html
If a path that was disallowed in the original policy exists in the mirror tree, the override was active for that run.
- Restore the normal crawler policy as soon as the special-case download is complete.
~/.wgetrc robots = on
Removing the override or setting it back to on keeps later recursive downloads polite by default.
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
