Heise published an obituary for robots.txt in October 2025: "Abschied von robots.txt (1994-2025)" — a wistful piece arguing the protocol that civilized the web is dead because AI crawlers have made compliance optional. The framing is half right. The protocol IS no longer the silent default it was for two decades. Compliance is no longer universal. The set of bots fetching the average site exploded from a handful of search engines to dozens of AI crawlers, agents, and downstream scrapers, each with its own posture toward robots.txt.
But the obituary skips the inconvenient half: the major AI labs publicly committed to honoring robots.txt for their declared user-agents, and CDN operators including Cloudflare have reported broadly high compliance for the named bots, with documented exceptions (Perplexity was caught ignoring robots.txt in Wired's June 2024 reporting). The protocol still works for the bots that publicly say they respect it. It just doesn't substitute for a firewall when you need real enforcement. This guide is the honest 2026 picture: what robots.txt is, why it still matters, the 19 AI crawler user-agents you actually need to know, the five patterns that work, the six mistakes most sites make, and the audit workflow that catches all of them. Live audit of 10 top-ranking robots.txt guides on Google EN + DE shows the gap: 9 of 10 don't ship FAQPage, 3 don't mention AI crawlers at all, and the most-cited US guide hasn't been touched since March 2025.
What Robots.txt Actually Is
Robots.txt is a plain-text file at the root of your domain (yoursite.com/robots.txt) that tells web crawlers which paths they may fetch and which they should leave alone. The file format is standardized in RFC 9309, published by the IETF in September 2022 — the first time the protocol got a formal specification after 28 years of de-facto adoption.
The file is fetched by every well-behaved crawler before any other request to your domain. Googlebot, bingbot, GPTBot, ClaudeBot, PerplexityBot — they all start by requesting yoursite.com/robots.txt, parse it, and use the rules to decide which URLs to fetch next. The protocol is voluntary: nothing forces a bot to honor what it reads. But for the bots that publicly commit to respecting robots.txt, the file is the single most efficient way to control what they crawl.
One detail most guides skip: robots.txt controls fetching, not indexing. A page disallowed in robots.txt won't be crawled, but it can still appear in Google's index (with no title and no description) if other sites link to it. To keep a page out of the index, you need a noindex meta tag or HTTP header — which only works if the bot is allowed to fetch the page in the first place. This nuance is the source of mistake number one in the section below.
Why Robots.txt Still Matters in 2026
Robots.txt matters in 2026 for the same reason it mattered in 2010, plus a new one. Crawl budget is real: Googlebot allocates a finite number of fetches per domain per day, and disallowed pages free those fetches for content that needs to be indexed. AI crawlers add a second layer — train opt-out, retrieval opt-in, all through one file.
The classical case is straightforward. Sites with thousands of low-value URLs (faceted navigation, internal search, calendar pages, session-tagged URLs) waste crawl budget if they let Googlebot fetch them. A precise robots.txt redirects that budget to canonical content and improves the overall freshness of the indexed pages. Mueller has confirmed this on multiple Search Off the Record episodes: crawl prioritization responds to disallow rules, and the response is usually visible within a week.
The new case is AI. ChatGPT, Claude, Perplexity, and Google's Gemini Suite each operate one or more crawlers, and the crawlers split into three jobs: train future models, build a retrieval index for live answers, and act as a logged-in agent on a user's behalf. Each job uses a distinct user-agent. Robots.txt is the only standardized place where you can tell a training crawler "no" and a retrieval crawler "yes" at the same time, opting out of being LLM training data while staying citable in real-time AI search.
What 10 top-ranking robots.txt guides actually ship
Audited the top 5 EN + top 5 DE results for "robots.txt seo" / "robots.txt" via Playwright using Lumina's Schema Validator + Meta Tag Analyzer. The "obituary"-vs-"guide" split is the real story.
Article schema between them.@id. Most others ship inline blocks or skip the connection — KI-Citation can't link byline to brand.Run the same audit on any URL with Lumina's Crawler Access Checker →
The File Format (RFC 9309)
RFC 9309 defines exactly five things in a robots.txt file: User-agent declarations, Allow rules, Disallow rules, the comment character (#), and a few minor formatting rules. Everything else (Crawl-delay, Sitemap, Host) is non-standard but widely supported. The specification is short enough to read in fifteen minutes and answers most edge cases your team will argue about.
Here's a minimal valid robots.txt that demonstrates every directive a typical site needs:
User-agent: *
Disallow: /admin/
Disallow: /search?
Disallow: /tmp/
Allow: /admin/help/
User-agent: Googlebot
Disallow: /staging/
Sitemap: https://example.com/sitemap.xml
The file reads top to bottom. Each User-agent line opens a group, and the rules below it apply to that user-agent until the next User-agent line. The first group with a matching user-agent wins (most specific match). Wildcards work for path matching: * matches any sequence of characters, $ matches the end of the URL. So Disallow: /*.pdf$ blocks every PDF on the site.
The Allow directive overrides a Disallow within the same group. The order doesn't matter — the longest matching rule wins. So Disallow: /admin/ followed by Allow: /admin/help/ means "block the admin area but let crawlers fetch the help section." Most sites get this wrong by putting the Allow first; it doesn't change behavior, but it confuses readers.
Three things that look like part of the spec but aren't: Crawl-delay (Bing and Yandex respect it; Google ignores it; the value is seconds between requests), Host (Yandex uses it for canonical domain selection; nobody else does), and the noindex robots.txt directive that some old guides reference (Google removed support in September 2019 — never reliable, never coming back).
The AI Crawler Layer: 19 New User-Agents
The AI crawler population grew from zero to nineteen between 2022 and 2026. Each operator has its own user-agents, often split by purpose. The split that matters most is training versus retrieval: training crawlers fetch your content to teach future models; retrieval crawlers fetch live during user queries and provide the citations.
| Operator | User-agent | Purpose |
|---|---|---|
| OpenAI | GPTBot | Training crawler for future GPT models |
| OpenAI | ChatGPT-User | On-demand fetch when a ChatGPT user asks a question |
| OpenAI | OAI-SearchBot | Indexing crawler for ChatGPT Search results |
| Anthropic | ClaudeBot | Training crawler for future Claude models |
| Anthropic | Claude-User | On-demand fetch when a Claude user invokes web search |
| Anthropic | Claude-SearchBot | Indexing crawler for Claude's web search |
| Anthropic | anthropic-ai | Legacy training user-agent (still active) |
| Perplexity | PerplexityBot | Indexing crawler for Perplexity's search index |
| Perplexity | Perplexity-User | On-demand fetch when a Perplexity user asks a question |
Google-Extended | Training opt-out for Gemini and Vertex AI | |
Google-Agent | Project Mariner agent on behalf of a logged-in user | |
| Apple | Applebot-Extended | Training opt-out for Apple Intelligence |
| Common Crawl | CCBot | Public dataset used by many open-source LLMs |
| ByteDance | Bytespider | Training crawler for TikTok / Doubao models |
| Meta | Meta-ExternalAgent | Crawler for Meta AI features |
| Mistral | MistralAI-User | On-demand fetch from Le Chat (Mistral's UI) |
| DeepSeek | DeepSeekBot | Training crawler for DeepSeek models |
| xAI | xAI-Web-Crawler | Crawler for Grok |
| Cohere | cohere-ai | Training crawler for Cohere models |
The training-vs-retrieval distinction is the actionable one. If you want to opt out of being LLM training data but stay citable in AI search, block the training crawlers and allow the retrieval ones. Concrete example for OpenAI: Disallow: GPTBot blocks training, Allow: OAI-SearchBot, ChatGPT-User keeps you citable in ChatGPT Search. Anthropic works the same way: block ClaudeBot and anthropic-ai for training opt-out, allow Claude-SearchBot and Claude-User for live citation. This is the configuration most publishers actually want, and it's the configuration most existing robots.txt files don't have because they pre-date the user-agent split.
The 5 Patterns That Work
Five robots.txt configurations cover roughly 90% of real-world sites. Each solves a different problem; most sites need a combination of two or three. The pattern you start with depends on whether your priority is crawl budget, AI training opt-out, AI retrieval visibility, or content security (with the caveat that robots.txt is the wrong tool for security).
1. The minimal everything-allowed file
For a site with no specific blocking needs, two lines do the job:
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
This explicitly says "all crawlers welcome, full site allowed." The Sitemap line points crawlers to your sitemap.xml. Nothing else is needed. Many small sites overcomplicate their robots.txt; this is the right baseline.
2. Crawl budget protection
For sites with faceted navigation, internal search, calendars, or session-tagged URLs, target the high-noise paths:
User-agent: *
Disallow: /search?
Disallow: /*?session=
Disallow: /*?utm_
Disallow: /tag/
Disallow: /author/
Sitemap: https://example.com/sitemap.xml
This keeps Googlebot focused on canonical content. Aim for blocking parameter explosions and never block the actual product, article, or service pages.
3. AI training opt-out, retrieval allowed
The 2026 default for most publishers — opt out of being training data, stay citable in AI search:
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Bytespider
Disallow: /
# Retrieval bots stay allowed — citations in AI answers welcome
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
4. Full AI block
For brand or legal reasons (paywalled news, copyrighted IP, regulated content), block all AI crawlers including retrieval:
User-agent: GPTBot
User-agent: ChatGPT-User
User-agent: OAI-SearchBot
User-agent: ClaudeBot
User-agent: Claude-User
User-agent: Claude-SearchBot
User-agent: anthropic-ai
User-agent: PerplexityBot
User-agent: Perplexity-User
User-agent: Google-Extended
User-agent: Applebot-Extended
User-agent: CCBot
User-agent: Bytespider
User-agent: Meta-ExternalAgent
User-agent: MistralAI-User
User-agent: DeepSeekBot
User-agent: xAI-Web-Crawler
User-agent: cohere-ai
Disallow: /
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
Note that this only stops the named bots. Common Crawl's CCBot scrapes for many downstream open-source models, so blocking CCBot indirectly affects models you've never heard of. For hard enforcement, layer this with a Cloudflare WAF rule or a server-side block.
5. Section-specific rules
For sites where some sections should block AI but others should allow it (e.g., a marketing site that allows AI on the blog but blocks it on the docs):
User-agent: GPTBot
Disallow: /docs/
Allow: /
User-agent: ClaudeBot
Disallow: /docs/
Allow: /
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
The longest-matching-rule logic means Disallow on /docs/ wins over Allow on /, so the docs section is blocked for the named bots while the rest of the site stays open.
The 6 Most Common Mistakes
Six recurring mistakes account for the bulk of robots.txt issues I see in audits. None of them are subtle, all of them are common, and each one costs traffic, crawl budget, or AI visibility. Most production robots.txt files I audit ship at least two of the six. The fixes below take an afternoon to apply.
1. Disallowing a page you want to deindex
The most common mistake on the entire web. To remove a page from Google, the meta noindex tag is the right tool — but if the page is also disallowed in robots.txt, Googlebot won't fetch the page, won't see the noindex tag, and may keep the URL in the index based on inbound links. The correct sequence: leave the page crawlable, add noindex, wait for Google to re-crawl and drop it, then optionally add the disallow back if you want to block future crawls.
2. Blocking CSS or JavaScript
Older robots.txt files often block /wp-content/, /assets/, or /js/ for "speed" reasons. This breaks Google's ability to render the page properly. Googlebot needs CSS and JS to evaluate mobile-friendliness, layout shift, and content visibility. Modern Googlebot rendering depends on full asset access. Allow CSS and JS unless you have a very specific reason to block, and never block the path that contains your CMS asset bundles.
3. Using robots.txt as a security layer
Robots.txt is publicly readable. Listing /admin/, /staging/, or /backup/ in your disallow rules tells anyone curious enough to fetch yoursite.com/robots.txt exactly which paths exist and which paths you'd rather they didn't visit. Use HTTP authentication, IP allowlists, or VPN-only access for anything you actually want protected. Reserve robots.txt for paths you simply don't want crawled — not paths you don't want discovered.
4. Wildcards in user-agent names
RFC 9309 allows wildcards in path matching but not in user-agent names. Writing User-agent: GPT* doesn't match GPTBot — it matches a literal user-agent string of "GPT*", which no bot uses. Each AI crawler needs to be listed by its exact name. Yes, that means up to 19 User-agent lines if you want to block them all (see pattern 4 above).
5. Conflicting rules across multiple files
A robots.txt file lives at exactly one location: the root of the domain (yoursite.com/robots.txt). If you have a CMS that auto-generates one and a static file in your repo, you may end up with two competing files. Browsers will fetch whichever is served first. Audit method: fetch yoursite.com/robots.txt yourself and confirm what you actually serve matches what you intend.
6. Forgetting the Sitemap directive
The Sitemap directive in robots.txt is the most reliable way to tell every crawler where to find your sitemap. Most CMSes don't add it automatically, so you have to. One line at the bottom: Sitemap: https://yoursite.com/sitemap.xml. Add additional Sitemap lines if you have multiple sitemaps (one per language, one per content type). This costs nothing and helps every crawler discover your full URL set.
Robots.txt vs. Noindex vs. WAF: Which Tool When
Robots.txt, the noindex meta tag, and a server-side WAF block solve different problems and combining them in the wrong order produces predictable failures. The right tool depends on whether you want the bot to fetch the URL, see the URL, or simply not find the URL at all.
| Tool | What it does | When to use |
|---|---|---|
| robots.txt Disallow | Tells well-behaved bots not to fetch the URL. Doesn't prevent indexing if other sites link to the URL. | Crawl budget control. Faceted nav, search results, calendar pages, session URLs. Path-level rules. |
| noindex meta tag | Tells indexers to fetch the URL but not include it in search results. Per-page directive. | Pages you want kept out of the index (thank-you pages, low-value templates, draft content). Page-level fine control. |
| X-Robots-Tag header | Same as noindex but applies via HTTP header to non-HTML files (PDFs, images). Server-level. | De-indexing PDFs, downloads, or other binary files where a meta tag isn't possible. |
| WAF / firewall block | Refuses the bot's request at the network layer. Hard enforcement. | Bots that ignore robots.txt. Scrapers without declared user-agents. Paywalled content. Aggressive AI training opt-out. |
| HTTP authentication | Requires a valid credential before serving any content. Network + application layer. | Genuinely private content. Staging environments. Internal tools. |
The most common combination error: robots.txt Disallow + noindex on the same URL. The disallow prevents the noindex from being seen, so the URL can still appear in Google's index without a snippet. If you want a URL out of the index, leave it crawlable, add noindex, and only consider disallow once Google has re-crawled and dropped it (typically a few weeks).
How to Audit Your Robots.txt (3 Methods)
Three audit methods catch the bulk of robots.txt issues. Each takes under fifteen minutes for a typical site, and running all three on a critical domain gives you cross-validation. Most production robots.txt files I audit have at least one issue from this list.
Method 1: Lumina Crawler Access Checker
Paste any URL into Lumina's Crawler Access Checker and the tool fetches your robots.txt, parses every rule, and checks the access status for 36 distinct crawlers (19 AI plus 17 classical search bots). For each bot you get a clear allowed-or-blocked verdict for the page you submitted, plus the matching rule that produced the verdict. Use it to verify that your training opt-out actually blocks the bots you think it blocks and that your retrieval allowlist actually allows the bots you want citing you.
Method 2: Google Search Console robots.txt report
In Search Console, go to Settings → Crawling → robots.txt. The report shows the version of your robots.txt that Google last fetched, any parsing errors, and a tester that lets you check whether Googlebot can fetch a specific URL under the current rules. Use it as the source-of-truth for what Google actually sees, especially after a deploy when caching can lag for up to 24 hours.
Method 3: Direct fetch + manual review
Open yoursite.com/robots.txt in a browser. Confirm: the file exists, returns HTTP 200, has a Sitemap line at the bottom, lists user-agents you actually want to block, and has no surprises (CMS-injected paths, leftover staging rules, or duplicate User-agent groups). Run this manual check at least quarterly — robots.txt is exactly the kind of file that gets quietly modified by plugins or deploy scripts and never reviewed.
AI Crawlers and the New Compliance Reality
Compliance with robots.txt by AI crawlers is voluntary, like every robots.txt rule, but the major operators have publicly committed to honor it for their declared user-agents. OpenAI documented GPTBot's compliance in August 2023; Anthropic, Perplexity, Google, and Apple followed with similar statements. Cloudflare's 2024 reports indicated broadly high compliance for the named bots, with documented exceptions (Perplexity was caught ignoring robots.txt by Wired in June 2024).
The gap is what happens when content gets fetched by a bot that doesn't respect robots.txt at all. Common Crawl's CCBot is technically compliant — it honors disallow rules — but the public dataset it produces is consumed by dozens of downstream AI models, some of which never identify themselves on your site. Blocking CCBot is the closest robots.txt comes to a "block training generally" option, but even that doesn't catch every downstream use.
The deeper compliance question: even when an AI lab publicly commits to robots.txt, the commitment binds the named user-agent. A research team within the same company can run an unnamed crawler and the official robots.txt commitment doesn't apply to it. This is why the publishers who want hard enforcement layer Cloudflare's AI Bot Block (which blocks at the network level) on top of their robots.txt rules. Robots.txt is a polite request; the WAF is the enforcement.
For most sites, the polite request is enough. The major AI labs have business reasons to honor it (regulatory pressure, brand risk, PR), and the data shows they do. But "enough for most sites" is not "enough for all sites." Publishers with high-value paywalled content, regulated industries, and brand-sensitive properties should run robots.txt + WAF together rather than relying on robots.txt alone.
A 5-Step Robots.txt Workflow for 2026
Five steps will give you a clean, modern robots.txt setup in a focused afternoon. The first three are diagnostic; the last two are configuration. Most sites complete the whole workflow in two or three hours, including time for a real audit + a careful policy decision before any rules ship.
Fetch yoursite.com/robots.txt directly. Run it through Lumina's Crawler Access Checker. Note: which AI crawlers you currently block, which you allow, whether the Sitemap line is present, whether any rules look stale.
Run Crawler Access Checker →Three choices: full allow, training opt-out + retrieval allow, full AI block. Most publishers want option two. Document the choice with one sentence in your team docs so the file's intent is obvious to future maintainers.
See the 5 patterns →Open Search Console crawl stats. Look for high-frequency fetches on parameter URLs, internal search, calendar pages, session-tagged URLs. Each one is a candidate for a Disallow rule that frees budget for your real content.
Verify with Sitemap Analyzer →Build the file from the patterns above. Test syntax with Google Search Console's robots.txt tester. Keep it short — 30 to 60 lines is normal for most sites. Avoid the temptation to copy a 500-line robots.txt from a top site you admire.
Re-test before deploy →Deploy the file. Wait 24 hours for caches to clear. Re-run Lumina's Crawler Access Checker on a sample of pages to confirm the new rules apply as intended. Set a quarterly recurring reminder to re-audit — robots.txt is exactly the kind of file that quietly drifts.
Verify after deploy →FAQ
Where to Start
If you can do exactly one thing this week, fetch yoursite.com/robots.txt and run it through Lumina's Crawler Access Checker. Most sites I audit have at least one of three issues: a stale rule still blocking a section that should be open, a missing AI crawler block, or a missing Sitemap directive. Fixing those three things takes an afternoon.
If you have more time, work through the 5-step workflow above. The biggest gain for most publishers in 2026 is the training opt-out plus retrieval allow pattern. It tells the AI labs you've thought about your data, opts you out of training datasets that you don't benefit from, and keeps you citable in real-time AI answers — which is where AI-driven traffic actually lives. Both updates ship in the same file, the deploy is one commit, and the effect is visible in Crawler Access Checker results within minutes.
Audit your robots.txt now
Lumina's Crawler Access Checker tests 36 distinct crawlers (19 AI + 17 classical) against any URL — no signup, free.
Run Crawler Access Checker →