How to create a robots.txt file for your website

Search crawlers can spend requests on generated, duplicate, or low-value paths that do not belong in routine discovery. A root-level robots.txt file gives compliant crawlers a public crawl policy for those paths while leaving the rest of the origin available.

Each file applies only to the protocol, host, and port where it is served. The filename must be lowercase, the response must use UTF-8 plain text, and path matching is case-sensitive, so a policy on https://www.example.com/robots.txt does not govern another subdomain or the HTTP version of that host.

Crawl rules are not access controls or indexing directives. A blocked URL can still appear in search results without a snippet, non-compliant bots can ignore the file, and blocking shared CSS or JavaScript can prevent search crawlers from rendering public pages accurately.

Steps to create a robots.txt file for your website:

Choose the exact origin and URL paths that need crawl restrictions.

The public file can expose private path names, customer details, credentials, and unreleased project names. Sensitive content requires authentication rather than crawler compliance.
Create /var/www/html/robots.txt in the public document root for the selected origin.
```
User-agent: *
Disallow: /search/
Disallow: /cart/
Disallow: /checkout/
Sitemap: https://www.example.com/sitemap.xml
```
/var/www/html/ represents the selected site's public document root and may differ by server configuration. The wildcard group covers compliant crawlers without a more specific group, and the fully qualified Sitemap: URL may point to a sitemap or sitemap index.
Add an Allow: rule only for a narrower path that must remain crawlable inside a blocked parent.
```
Disallow: /catalog/filter/
Allow: /catalog/filter/help/
```
The longest matching path wins, so this exception leaves /catalog/filter/help/ crawlable while the broader filter path remains blocked.
Review the completed file against the intended crawler groups and path boundaries.

Valid syntax has one directive per line, path values beginning with /, and path letter case that matches the site's URLs. Google ignores unsupported fields such as Crawl-delay.
Publish the file as lowercase /robots.txt at the root of the selected origin.

A production wildcard rule containing Disallow: / blocks compliant crawlers from every path on that origin.

Confirm the live robots.txt response from the selected origin.

$ curl --fail-with-body --silent --show-error --dump-header - https://www.example.com/robots.txt
HTTP/1.1 200 OK
content-type: text/plain; charset=utf-8
content-length: 116
connection: close

User-agent: *
Disallow: /search/
Disallow: /cart/
Disallow: /checkout/
Sitemap: https://www.example.com/sitemap.xml

A successful root response with the intended plain-text rules proves that the selected origin is publishing the file. A redirect, HTML page, authentication challenge, or HTTP error means crawlers are not receiving this policy as shown.

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.