A full website mirror is useful when a site needs to remain readable offline, preserved before a migration, or reviewed without repeatedly hitting the origin server. For static or mostly static sites, wget can capture the HTML plus its dependent assets in one reproducible run.

Website mirroring builds on recursive retrieval, but it adds the options that make the result browsable from disk. --mirror is the shortcut for -N -r -l inf --no-remove-listing, while --page-requisites collects referenced assets such as images and stylesheets, and --convert-links rewrites eligible internal links so the local copy works without an active network connection.

Mirroring an entire site needs stricter boundaries than downloading one directory. Crawl policy, allowed domains, pacing, and the site's rendering model should be reviewed before the live run, especially when the source is public, large, rate limited, or heavily driven by client-side JavaScript. Dynamic sessions and authenticated application flows often need a different capture method.

Steps to mirror an entire website with wget:

  1. Create the local mirror root and move into it before starting discovery.
    $ mkdir -p ~/mirrors/docs.example.net
    $ cd ~/mirrors/docs.example.net
    $ pwd
    /home/user/mirrors/docs.example.net

    Keeping each site in its own root directory makes repeated captures and comparison work much easier.

  2. Review the site's crawl policy before running a broad mirror.
    $ wget --output-document=review-robots.txt https://docs.example.net/robots.txt
    --2026-03-29 09:21:15--  https://docs.example.net/robots.txt
    Resolving docs.example.net (docs.example.net)... 203.0.113.50
    Connecting to docs.example.net (docs.example.net)|203.0.113.50|:443... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 91 [text/plain]
    Saving to: 'review-robots.txt'
    
    2026-03-29 09:21:15 (339 KB/s) - 'review-robots.txt' saved [91/91]
    
    $ sed -n '1,6p' review-robots.txt
    User-agent: *
    Allow: /docs/
    Disallow: /private/
    Sitemap: https://docs.example.net/sitemap.xml

    Ignoring published crawl policy can trigger bans, legal complaints, or incomplete results if the site deliberately blocks automated traversal.

  3. Use spider mode first to validate the scope without downloading the site.
    $ wget --spider --recursive --level=3 --no-parent \
      --domains=docs.example.net \
      https://docs.example.net/
    Spider mode enabled. Check if remote file exists.
    --2026-03-29 09:21:15--  https://docs.example.net/
    Resolving docs.example.net (docs.example.net)... 203.0.113.50
    Connecting to docs.example.net (docs.example.net)|203.0.113.50|:443... connected.
    HTTP request sent, awaiting response... 200 OK
    Remote file exists and could contain links to other resources -- retrieving.
    
    --2026-03-29 09:21:15--  https://docs.example.net/docs/index.html
    HTTP request sent, awaiting response... 200 OK
    Remote file exists and could contain links to other resources -- retrieving.
    
    --2026-03-29 09:21:15--  https://docs.example.net/assets/site.css
    HTTP request sent, awaiting response... 200 OK
    Remote file exists and could contain links to other resources -- retrieving.
    
    Found no broken links.

    A dry run is the fastest way to catch a bad start URL or overly broad recursion depth before large transfers begin.

  4. Run the full mirror with link conversion, page requisites, and pacing enabled.
    $ wget --mirror --convert-links --adjust-extension --page-requisites \
      --no-parent --domains=docs.example.net \
      --wait=1 --random-wait --limit-rate=250k \
      https://docs.example.net/
    --2026-03-29 09:21:31--  https://docs.example.net/
    Resolving docs.example.net (docs.example.net)... 203.0.113.50
    Connecting to docs.example.net (docs.example.net)|203.0.113.50|:443... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 391 [text/html]
    Saving to: 'docs.example.net/index.html'
    
    --2026-03-29 09:21:31--  https://docs.example.net/robots.txt
    HTTP request sent, awaiting response... 200 OK
    Length: 91 [text/plain]
    Saving to: 'docs.example.net/robots.txt'
    
    --2026-03-29 09:21:32--  https://docs.example.net/assets/site.css
    HTTP request sent, awaiting response... 200 OK
    Length: 63 [text/css]
    Saving to: 'docs.example.net/assets/site.css'
    
    --2026-03-29 09:21:34--  https://docs.example.net/docs/index.html
    HTTP request sent, awaiting response... 200 OK
    Length: 240 [text/html]
    Saving to: 'docs.example.net/docs/index.html'
    
    --2026-03-29 09:21:35--  https://docs.example.net/docs/overview.html
    HTTP request sent, awaiting response... 200 OK
    Length: 251 [text/html]
    Saving to: 'docs.example.net/docs/overview.html'
    
    FINISHED --2026-03-29 09:21:35--
    Total wall clock time: 4.8s
    Downloaded: 6 files, 1.2K in 0s (2.62 MB/s)
    Converted links in 4 files in 0.001 seconds.
    Option Purpose
    --mirror Shortcut for --timestamping --recursive --level=inf --no-remove-listing.
    --page-requisites Downloads assets needed to render mirrored HTML pages correctly.
    --convert-links Rewrites eligible internal links for local browsing.
    --domains Defines the allowed host list for redirects or host-spanning retrievals.
    --wait / --random-wait Adds pacing so the mirror behaves more politely.
  5. Verify that the mirrored tree contains both HTML pages and dependent assets.
    $ find docs.example.net -maxdepth 4 -type f | sort
    docs.example.net/assets/site.css
    docs.example.net/docs/index.html
    docs.example.net/docs/overview.html
    docs.example.net/image/logo.svg
    docs.example.net/index.html
    docs.example.net/robots.txt

    A useful mirror needs the page files and the assets they reference, not just the starting HTML document.

  6. Confirm that internal links were rewritten for offline use.
    $ grep -o 'href="[^"]*"' docs.example.net/index.html | head -n 3
    href="assets/site.css"
    href="docs/index.html"
    href="docs/overview.html"

    Converted local paths are the signal that the mirror can be browsed from disk instead of depending on the original website.