Recursive wget jobs normally stop at paths excluded by /robots.txt/, which is the right default for mirrors, audits, and bulk retrievals that should stay polite. When an authorized capture has to include those paths, a temporary robots override lets the download proceed without redesigning the rest of the command.

During recursive retrieval wget fetches /robots.txt/ once per host and uses it to skip disallowed paths. It also honors document-level nofollow hints, and the -e robots=off form disables both robot-exclusion checks for one run while robots = off in /$HOME/.wgetrc/ makes the override persistent.

The override should stay narrowly scoped because it can expand both crawl depth and request volume very quickly. Keep recursion bounded, use only approved hosts or paths, and remove any persistent override as soon as the special capture is complete.

Steps to ignore robots.txt in wget:

  1. Fetch the published robots policy first so the excluded paths are documented before the override.
    $ wget -qO- https://archive.example.net/robots.txt
    User-agent: *
    Disallow: /exports/internal/

    Capturing the baseline policy makes it obvious which URLs the default crawl would skip. Related: How to mirror an entire website with wget

  2. Run the bounded recursive job once without an override and confirm that disallowed paths stay out of the result tree.
    $ wget -r -l 1 --no-parent https://archive.example.net/exports/
    Loading robots.txt; please ignore errors.
    Saving to: 'archive.example.net/exports/index.html'
    Saving to: 'archive.example.net/robots.txt'
    Saving to: 'archive.example.net/exports/public/status.html'
    Downloaded: 3 files

    A baseline run makes the override easier to audit because the missing paths are known before anything changes.

  3. Re-run the same command with -e robots=off and verify that previously blocked paths are now downloaded.
    $ wget -r -l 1 --no-parent -e robots=off https://archive.example.net/exports/
    Saving to: 'archive.example.net/exports/index.html'
    Saving to: 'archive.example.net/exports/public/status.html'
    Saving to: 'archive.example.net/exports/internal/report.html'
    Downloaded: 3 files

    The -e robots=off form applies to this command only, which makes it the safest choice for one audit or mirror job.

  4. Persist the override in the user startup file only when the same approved job must repeat under the same account.
    ~/.wgetrc
    robots = off

    A persistent override changes every later recursive wget run for that user until it is removed or reset. Related: How to configure default options in ~/.wgetrc

  5. Confirm the active behavior from the downloaded tree before handing the result to another system.
    $ find archive.example.net -type f | sort
    archive.example.net/exports/index.html
    archive.example.net/exports/internal/report.html
    archive.example.net/exports/public/status.html

    If a path that was disallowed in the original policy exists in the mirror tree, the override was active for that run.

  6. Restore the normal crawler policy as soon as the special-case download is complete.
    ~/.wgetrc
    robots = on

    Removing the override or setting it back to on keeps later recursive downloads polite by default.