A full website mirror is useful when a site needs to remain readable offline, preserved before a migration, or reviewed without repeatedly hitting the origin server. For static or mostly static sites, wget can capture the HTML plus its dependent assets in one reproducible run.
Website mirroring builds on recursive retrieval, but it adds the options that make the result browsable from disk. --mirror is the shortcut for -N -r -l inf --no-remove-listing, while --page-requisites collects referenced assets such as images and stylesheets, and --convert-links rewrites eligible internal links so the local copy works without an active network connection.
Mirroring an entire site needs stricter boundaries than downloading one directory. Crawl policy, allowed domains, pacing, and the site's rendering model should be reviewed before the live run, especially when the source is public, large, rate limited, or heavily driven by client-side JavaScript. Dynamic sessions and authenticated application flows often need a different capture method.
Steps to mirror an entire website with wget:
- Create the local mirror root and move into it before starting discovery.
$ mkdir -p ~/mirrors/docs.example.net $ cd ~/mirrors/docs.example.net $ pwd /home/user/mirrors/docs.example.net
Keeping each site in its own root directory makes repeated captures and comparison work much easier.
- Review the site's crawl policy before running a broad mirror.
$ wget --output-document=review-robots.txt https://docs.example.net/robots.txt --2026-03-29 09:21:15-- https://docs.example.net/robots.txt Resolving docs.example.net (docs.example.net)... 203.0.113.50 Connecting to docs.example.net (docs.example.net)|203.0.113.50|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 91 [text/plain] Saving to: 'review-robots.txt' 2026-03-29 09:21:15 (339 KB/s) - 'review-robots.txt' saved [91/91] $ sed -n '1,6p' review-robots.txt User-agent: * Allow: /docs/ Disallow: /private/ Sitemap: https://docs.example.net/sitemap.xml
Ignoring published crawl policy can trigger bans, legal complaints, or incomplete results if the site deliberately blocks automated traversal.
Related: How to ignore robots.txt in wget
- Use spider mode first to validate the scope without downloading the site.
$ wget --spider --recursive --level=3 --no-parent \ --domains=docs.example.net \ https://docs.example.net/ Spider mode enabled. Check if remote file exists. --2026-03-29 09:21:15-- https://docs.example.net/ Resolving docs.example.net (docs.example.net)... 203.0.113.50 Connecting to docs.example.net (docs.example.net)|203.0.113.50|:443... connected. HTTP request sent, awaiting response... 200 OK Remote file exists and could contain links to other resources -- retrieving. --2026-03-29 09:21:15-- https://docs.example.net/docs/index.html HTTP request sent, awaiting response... 200 OK Remote file exists and could contain links to other resources -- retrieving. --2026-03-29 09:21:15-- https://docs.example.net/assets/site.css HTTP request sent, awaiting response... 200 OK Remote file exists and could contain links to other resources -- retrieving. Found no broken links.
A dry run is the fastest way to catch a bad start URL or overly broad recursion depth before large transfers begin.
- Run the full mirror with link conversion, page requisites, and pacing enabled.
$ wget --mirror --convert-links --adjust-extension --page-requisites \ --no-parent --domains=docs.example.net \ --wait=1 --random-wait --limit-rate=250k \ https://docs.example.net/ --2026-03-29 09:21:31-- https://docs.example.net/ Resolving docs.example.net (docs.example.net)... 203.0.113.50 Connecting to docs.example.net (docs.example.net)|203.0.113.50|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 391 [text/html] Saving to: 'docs.example.net/index.html' --2026-03-29 09:21:31-- https://docs.example.net/robots.txt HTTP request sent, awaiting response... 200 OK Length: 91 [text/plain] Saving to: 'docs.example.net/robots.txt' --2026-03-29 09:21:32-- https://docs.example.net/assets/site.css HTTP request sent, awaiting response... 200 OK Length: 63 [text/css] Saving to: 'docs.example.net/assets/site.css' --2026-03-29 09:21:34-- https://docs.example.net/docs/index.html HTTP request sent, awaiting response... 200 OK Length: 240 [text/html] Saving to: 'docs.example.net/docs/index.html' --2026-03-29 09:21:35-- https://docs.example.net/docs/overview.html HTTP request sent, awaiting response... 200 OK Length: 251 [text/html] Saving to: 'docs.example.net/docs/overview.html' FINISHED --2026-03-29 09:21:35-- Total wall clock time: 4.8s Downloaded: 6 files, 1.2K in 0s (2.62 MB/s) Converted links in 4 files in 0.001 seconds.
Option Purpose --mirror Shortcut for --timestamping --recursive --level=inf --no-remove-listing. --page-requisites Downloads assets needed to render mirrored HTML pages correctly. --convert-links Rewrites eligible internal links for local browsing. --domains Defines the allowed host list for redirects or host-spanning retrievals. --wait / --random-wait Adds pacing so the mirror behaves more politely. - Verify that the mirrored tree contains both HTML pages and dependent assets.
$ find docs.example.net -maxdepth 4 -type f | sort docs.example.net/assets/site.css docs.example.net/docs/index.html docs.example.net/docs/overview.html docs.example.net/image/logo.svg docs.example.net/index.html docs.example.net/robots.txt
A useful mirror needs the page files and the assets they reference, not just the starting HTML document.
- Confirm that internal links were rewritten for offline use.
$ grep -o 'href="[^"]*"' docs.example.net/index.html | head -n 3 href="assets/site.css" href="docs/index.html" href="docs/overview.html"
Converted local paths are the signal that the mirror can be browsed from disk instead of depending on the original website.
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
