A robots.txt file tells compliant crawlers which URL paths on a website they may fetch. Create one when the site has low-value or repetitive areas such as cart flows, internal search results, preview URLs, faceted filters, or other sections that should not consume routine crawler requests.
The file must be served as UTF-8 plain text from the exact root URL for the origin that needs the rules, such as https://www.example.com/robots.txt. Current Google documentation supports User-agent, Allow, Disallow, and Sitemap directives, applies the rules only to the same protocol, host, and port, and treats path values as case-sensitive.
Robots.txt controls crawling, not secrecy or guaranteed removal from search results. A blocked URL can still appear as a bare result if it was discovered elsewhere, Google treats a missing /robots.txt file as no crawl restriction for that host, and blocking shared CSS, JavaScript, or image paths can make crawler rendering less accurate.
Steps to create a robots.txt file for your website:
- List only the URL paths that need crawl control and leave privacy, login protection, redirects, and deindexing to the tools built for those jobs.
Good robots.txt candidates include internal search results, cart and checkout paths, faceted filter URLs, preview paths, and other low-value sections that should not absorb routine crawler requests.
Do not use robots.txt to protect private or staging content, because the file is public and blocked URLs can still be requested directly or appear as bare URLs in search results.
- Write one default crawler group with only the path rules the site actually needs.
User-agent: * Disallow: /search/ Disallow: /cart/ Disallow: /checkout/ Sitemap: https://www.example.com/sitemap.xml
Keep the syntax simple: one directive per line, path values that start with /, and a fully qualified Sitemap: URL when the site publishes an XML sitemap.
- Add an Allow: rule only when a smaller crawlable area must stay open inside a broader blocked directory.
User-agent: * Disallow: /private/ Allow: /private/help-center/
Allow: is useful only for a narrower exception inside a blocked parent path; if there is no exception, skip it.
- Publish the file as lowercase /robots.txt at the root of the exact protocol and host that need the policy. https://www.example.com/robots.txt
Each important origin needs its own file when the crawl policy differs, so https://example.com/robots.txt does not control https://www.example.com/, http://example.com/, or https://example.com:8443/.
- Fetch the live file directly and confirm that it returns readable plain text from the public URL.
$ curl -i https://www.example.com/robots.txt HTTP/2 200 content-type: text/plain; charset=utf-8 User-agent: * Disallow: /search/ Disallow: /cart/ Disallow: /checkout/ Sitemap: https://www.example.com/sitemap.xml
A direct fetch exposes wrong filenames such as robots.txt.txt, uploads to the wrong document root, HTML error pages, and redirects that point at another host.
- Open the robots.txt report in Google Search Console when the site is tracked as a Domain property or a host-level URL-prefix property without a path.
The report shows the fetched file, parsing issues, fetch history, and a recrawl action for urgent fixes after a broken fetch or a critical rule change.
- Recheck one blocked URL, one allowed public page, and any shared asset directory after publishing so the file is restricting only the intended paths.
Blocking a shared asset directory or publishing Disallow: / on the wrong origin can suppress discovery and weaken crawler rendering until the corrected file is fetched again.
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
