How to ignore robots.txt in wget

Recursive wget downloads respect a site's published robot rules by default, which is the right behavior for ordinary mirrors, audits, and offline copies. When you have explicit permission to fetch paths that the site blocks from crawlers, you can turn that check off for one bounded run.

Current GNU wget still uses the robots setting for this behavior. The command-line form --execute robots=off disables the usual /robots.txt/ check for the current command, while robots = off in ~/.wgetrc makes the override persistent for that account. The same setting also disables document-level nofollow rules during recursive retrieval.

Use the override narrowly. Keep the start URL, recursion depth, and output directory explicit, and restore the default policy when the approved capture is finished so later recursive jobs do not keep bypassing crawl rules.

Steps to ignore robots.txt in wget:

Review the published crawler policy before bypassing it.
```
$ wget --quiet --output-document=- https://archive.example.net/robots.txt
User-agent: *
Disallow: /exports/internal/
```
This confirms which path the default recursive run will refuse to fetch.

Run the recursive download once with the default policy and confirm that only the allowed files are saved.

$ wget --recursive --level=1 --no-parent https://archive.example.net/exports/
Saving to: 'archive.example.net/exports/index.html'
Loading robots.txt; please ignore errors.
Saving to: 'archive.example.net/robots.txt'
Saving to: 'archive.example.net/exports/public/status.html'
Downloaded: 3 files, 472 in 0s

The baseline run makes the later override easy to audit because the blocked path is still absent.

Re-run the same command with --execute robots=off and confirm that the previously blocked path is now downloaded.

$ wget --recursive --level=1 --no-parent --execute robots=off https://archive.example.net/exports/
Saving to: 'archive.example.net/exports/index.html'
Saving to: 'archive.example.net/exports/public/status.html'
Saving to: 'archive.example.net/exports/internal/report.html'
Downloaded: 3 files, 588 in 0s

Use this only for approved captures. The short -e robots=off form is the same override.

Persist the override in ~/.wgetrc only when the same approved recursive job must repeat under the same account.
```
~/.wgetrc
robots = off
```
This changes every later recursive wget run for that user until you remove it or set it back to on. Related: How to configure default options in ~/.wgetrc
Check the downloaded tree before you hand the result to another system or user.
```
$ ls archive.example.net/exports/internal
report.html
```
A file in the blocked directory confirms that the override was active for that run.
Restore the default crawler policy as soon as the special-case download is complete.
```
~/.wgetrc
robots = on
```
Remove the override entirely if you do not want any saved robots setting in the user profile.