How to use robots.txt for your website

A robots.txt file tells crawlers which URL paths on a website should be requested and which sections should stay out of routine crawl traffic. For a webmaster, the file is mainly useful for low-value or duplicate areas such as carts, checkout flows, internal search results, preview URLs, and other paths that should not keep drawing crawler attention away from the site's main public pages.

The file is a plain-text policy served from the site root at

https://www.example.com/robots.txt

, and crawlers read it before requesting other URLs on that host. Current Google documentation still treats robots.txt as crawl control rather than indexing control, with the practical rule set centered on User-agent, Allow, Disallow, and Sitemap directives.

The file is easy to overuse. A blocked URL can still appear in search if it was already known, a noindex rule only works when the crawler can still fetch the page that carries it, and blocking shared CSS, JavaScript, or image paths can interfere with rendering checks and page understanding. The safest pattern is a narrow, readable file that is fetched and rechecked after every meaningful change.

Steps to use robots.txt for a website:

  1. List the URL patterns that need crawl control and separate them from pages that actually need indexing, canonical, redirect, or access-control decisions.

    Robots.txt is appropriate for crawl management on areas such as internal search, cart, checkout, preview, or duplicate parameter paths, while noindex, canonical tags, redirects, or authentication solve different jobs.

    Do not use robots.txt as privacy or staging protection, because the file is public and blocked URLs can still be requested directly or remain visible in search if they were already discovered.

  2. Create a UTF-8 plain-text file named robots.txt with only the crawler groups and path rules that the site actually needs.
    User-agent: *
    Disallow: /cart/
    Disallow: /checkout/
    Sitemap: https://www.example.com/sitemap.xml

    Keep the syntax simple: one directive per line, comments only when they add real ownership context, and a fully qualified Sitemap: URL when the site has an XML sitemap.

  3. Add Allow: only when a broader blocked path contains a smaller area that should stay crawlable.
    User-agent: *
    Disallow: /private/
    Allow: /private/help-center/

    Google's current documented syntax support is built around User-agent, Allow, Disallow, and Sitemap, so avoid depending on non-standard directives as the main control path.

  4. Publish the file as
    robots.txt

    at the exact host root that needs the rule set and create separate root-level files for important subdomains when their crawl policy differs.

    https://www.example.com/robots.txt

    A file below the root, such as

    https://www.example.com/files/robots.txt

    , is not the site's crawl policy file, and managed platforms may expose this setting through their SEO or search controls instead of a filesystem upload.

  5. Fetch the public file directly and confirm that it returns plain text from the expected URL without redirects, HTML, login prompts, or a 404 response.
    $ curl -sS https://www.example.com/robots.txt
    User-agent: *
    Disallow: /cart/
    Disallow: /checkout/
    Sitemap: https://www.example.com/sitemap.xml

    A direct fetch quickly exposes wrong filenames such as

    robots.txt.txt

    , uploads to the wrong document root, stale cache behavior, and CMS routes that return an HTML page instead of the text file.

  6. Open the site's robots.txt report in Google Search Console after a change so the fetched file, syntax issues, and recrawl timing can be reviewed in one place.

    Google's current workflow still allows the report to show the fetched file and request a faster recrawl when the normal cache refresh window is too slow for the change that was just published.

  7. Recheck one important public page and one intentionally blocked path after publishing so the new rule set is hiding only the intended areas.

    Useful spot checks include a revenue or lead page that should stay crawlable, a blocked path such as

    /checkout/

    , and any shared CSS, JavaScript, or image directory that the live page needs for normal rendering.

    Blocking a shared asset directory or the whole site with

    Disallow: /

    can stop new discovery and distort rendering analysis until crawlers fetch the corrected file again.