Downloading all public content from a website into local storage enables offline browsing, long-term study, and reproducible access even when the original server is slow or unreachable. A complete mirror also simplifies searching across documentation sets, training material, or reference sites without repeatedly hitting the remote host.

The wget downloader can operate as a simple crawler, following internal HTTP and HTTPS links, fetching page prerequisites such as images and style sheets, and rewriting saved links so that navigation works from disk instead of the network. Recursive options like --mirror, --convert-links, and --page-requisites turn a single command into a structured snapshot of the site hierarchy.

Mirroring must respect /robots.txt directives, terms of use, and the capacity of the origin server, especially for large or media-heavy sites. Aggressive recursion, high concurrency, or unrestricted bandwidth can overload smaller hosts and trigger rate limits or bans. Conservative defaults, bandwidth caps, and explicit domain restrictions keep the process safer for both the mirror and the origin.

Steps to mirror a website with wget:

  1. Create a directory on local storage to hold the website mirror.
    $ mkdir -p ~/mirrors/example.com

    Using a dedicated directory keeps the mirror separate from other files and simplifies cleanup or repeated runs.

  2. Change to the mirror directory before starting the download.
    $ cd ~/mirrors/example.com
  3. Inspect the site /robots.txt file to see how automated access is expected to behave.
    $ curl --silent --show-error https://www.example.com/robots.txt
    User-agent: *
    Disallow: /private/
    Allow: /

    Ignoring /robots.txt or documented rate limits can cause excessive load, account suspension, or IP blocking by the site operator.

  4. Run wget in recursive mirror mode against the site root from the mirror directory.
    $ wget --mirror --convert-links --adjust-extension --page-requisites --no-parent https://www.example.com/
    --2025-12-21 08:12:41--  https://www.example.com/
    Resolving www.example.com (www.example.com)... 172.17.0.10
    Connecting to www.example.com (www.example.com)|172.17.0.10|:443... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 578 [text/html]
    Saving to: ‘www.example.com/index.html’
     
    ##### snipped #####
    FINISHED --2025-12-21 08:12:41--
    Downloaded: 17 files, 2.0M in 0.005s (424 MB/s)
    Option Description
    --mirror Enables recursive mirroring with sensible defaults (equivalent to several depth and timestamp flags).
    --convert-links Rewrites links in saved pages so navigation works from the local mirror.
    --adjust-extension Normalizes file extensions (for example, saving HTML responses as index.html).
    --page-requisites Downloads images, style sheets, and other assets required to render pages correctly.
    --no-parent Prevents traversal above the requested path, keeping the mirror within the intended subtree.
  5. Add polite throttling flags for larger or slower sites to reduce impact on the origin.
    $ wget --mirror --convert-links --adjust-extension --page-requisites --no-parent \
      --wait=1 --limit-rate=200k https://www.example.com/

    Very fast or parallel downloads can resemble abusive traffic and may disrupt smaller servers or trigger automatic blocks; throttling reduces this risk at the cost of longer mirror times.

  6. Confirm that the mirror contains an entry point and supporting assets in the expected directory tree.
    $ find . -maxdepth 3 -type f | sort | head
    ./www.example.com/index.html
    ./www.example.com/about/index.html
    ./www.example.com/contact/index.html
    ./www.example.com/css/styles.css
    ./www.example.com/data/image01.jpg
    ./www.example.com/data/index.html
    ./www.example.com/data/report.pdf
    ./www.example.com/docs/asset.css
    ./www.example.com/docs/guide.html
    ./www.example.com/docs/index.html
    ./www.example.com/files/archive.tar.gz

    Presence of /index.html pages, style sheets, and images in the mirrored tree is a strong indicator that recursion and page prerequisites were captured correctly.

  7. Open the mirrored entry page in a local browser from the mirror directory.
    $ cd www.example.com
    $ xdg-open index.html

    On macOS, use open index.html; on Windows, use start index.html from PowerShell or cmd to launch the default browser.

  8. Verify offline usability by browsing a few internal links while the browser reads only from disk.
    $ grep -m 5 "href=" index.html
      <link rel="stylesheet" href="css/styles.css">
        <li><a href="about/index.html">About</a></li>
        <li><a href="contact/index.html">Contact</a></li>
        <li><a href="data/index.html">Data directory</a></li>
        <li><a href="docs/index.html">Docs directory</a></li>

    Relative links such as ./about/index.html indicate that wget successfully converted absolute URLs into paths that work from the local mirror without contacting the remote server.

Discuss the article:

Comment anonymously. Login not required.