A robots.txt file tells crawlers which URL paths on a website should be requested and which sections should stay out of routine crawl traffic. For a webmaster, the file is mainly useful for low-value or duplicate areas such as carts, checkout flows, internal search results, preview URLs, and other paths that should not keep drawing crawler attention away from the site's main public pages.
The file is a plain-text policy served from the site root at
https://www.example.com/robots.txt
, and crawlers read it before requesting other URLs on that host. Current Google documentation still treats robots.txt as crawl control rather than indexing control, with the practical rule set centered on User-agent, Allow, Disallow, and Sitemap directives.
The file is easy to overuse. A blocked URL can still appear in search if it was already known, a noindex rule only works when the crawler can still fetch the page that carries it, and blocking shared CSS, JavaScript, or image paths can interfere with rendering checks and page understanding. The safest pattern is a narrow, readable file that is fetched and rechecked after every meaningful change.
Robots.txt is appropriate for crawl management on areas such as internal search, cart, checkout, preview, or duplicate parameter paths, while noindex, canonical tags, redirects, or authentication solve different jobs.
Do not use robots.txt as privacy or staging protection, because the file is public and blocked URLs can still be requested directly or remain visible in search if they were already discovered.
User-agent: * Disallow: /cart/ Disallow: /checkout/ Sitemap: https://www.example.com/sitemap.xml
Keep the syntax simple: one directive per line, comments only when they add real ownership context, and a fully qualified Sitemap: URL when the site has an XML sitemap.
User-agent: * Disallow: /private/ Allow: /private/help-center/
Google's current documented syntax support is built around User-agent, Allow, Disallow, and Sitemap, so avoid depending on non-standard directives as the main control path.
robots.txt
at the exact host root that needs the rule set and create separate root-level files for important subdomains when their crawl policy differs.
https://www.example.com/robots.txt
A file below the root, such as
https://www.example.com/files/robots.txt
, is not the site's crawl policy file, and managed platforms may expose this setting through their SEO or search controls instead of a filesystem upload.
$ curl -sS https://www.example.com/robots.txt User-agent: * Disallow: /cart/ Disallow: /checkout/ Sitemap: https://www.example.com/sitemap.xml
A direct fetch quickly exposes wrong filenames such as
robots.txt.txt
, uploads to the wrong document root, stale cache behavior, and CMS routes that return an HTML page instead of the text file.
Google's current workflow still allows the report to show the fetched file and request a faster recrawl when the normal cache refresh window is too slow for the change that was just published.
Useful spot checks include a revenue or lead page that should stay crawlable, a blocked path such as
/checkout/
, and any shared CSS, JavaScript, or image directory that the live page needs for normal rendering.
Blocking a shared asset directory or the whole site with
Disallow: /
can stop new discovery and distort rendering analysis until crawlers fetch the corrected file again.