Robots.txt Guide for SEO and AI Crawlers (2026)

Heise published an obituary for robots.txt in October 2025: "Abschied von robots.txt (1994-2025)" — a wistful piece arguing the protocol that civilized the web is dead because AI crawlers have made compliance optional. The framing is half right. The protocol IS no longer the silent default it was for two decades. Compliance is no longer universal. The set of bots fetching the average site exploded from a handful of search engines to dozens of AI crawlers, agents, and downstream scrapers, each with its own posture toward robots.txt.

But the obituary skips the inconvenient half: the major AI labs publicly committed to honoring robots.txt for their declared user-agents, and CDN operators including Cloudflare have reported broadly high compliance for the named bots, with documented exceptions (Perplexity was caught ignoring robots.txt in Wired's June 2024 reporting). The protocol still works for the bots that publicly say they respect it. It just doesn't substitute for a firewall when you need real enforcement. This guide is the honest 2026 picture: what robots.txt is, why it still matters, the 19 AI crawler user-agents you actually need to know, the five patterns that work, the six mistakes most sites make, and the audit workflow that catches all of them. Live audit of 10 top-ranking robots.txt guides on Google EN + DE shows the gap: 9 of 10 don't ship FAQPage, 3 don't mention AI crawlers at all, and the most-cited US guide hasn't been touched since March 2025.

What Robots.txt Actually Is

Robots.txt is a plain-text file at the root of your domain (yoursite.com/robots.txt) that tells web crawlers which paths they may fetch and which they should leave alone. The file format is standardized in RFC 9309, published by the IETF in September 2022 — the first time the protocol got a formal specification after 28 years of de-facto adoption.

The file is fetched by every well-behaved crawler before any other request to your domain. Googlebot, bingbot, GPTBot, ClaudeBot, PerplexityBot — they all start by requesting yoursite.com/robots.txt, parse it, and use the rules to decide which URLs to fetch next. The protocol is voluntary: nothing forces a bot to honor what it reads. But for the bots that publicly commit to respecting robots.txt, the file is the single most efficient way to control what they crawl.

One detail most guides skip: robots.txt controls fetching, not indexing. A page disallowed in robots.txt won't be crawled, but it can still appear in Google's index (with no title and no description) if other sites link to it. To keep a page out of the index, you need a noindex meta tag or HTTP header — which only works if the bot is allowed to fetch the page in the first place. This nuance is the source of mistake number one in the section below.

Why Robots.txt Still Matters in 2026

Robots.txt matters in 2026 for the same reason it mattered in 2010, plus a new one. Crawl budget is real: Googlebot allocates a finite number of fetches per domain per day, and disallowed pages free those fetches for content that needs to be indexed. AI crawlers add a second layer — train opt-out, retrieval opt-in, all through one file.

The classical case is straightforward. Sites with thousands of low-value URLs (faceted navigation, internal search, calendar pages, session-tagged URLs) waste crawl budget if they let Googlebot fetch them. A precise robots.txt redirects that budget to canonical content and improves the overall freshness of the indexed pages. Mueller has confirmed this on multiple Search Off the Record episodes: crawl prioritization responds to disallow rules, and the response is usually visible within a week.

The new case is AI. ChatGPT, Claude, Perplexity, and Google's Gemini Suite each operate one or more crawlers, and the crawlers split into three jobs: train future models, build a retrieval index for live answers, and act as a logged-in agent on a user's behalf. Each job uses a distinct user-agent. Robots.txt is the only standardized place where you can tell a training crawler "no" and a retrieval crawler "yes" at the same time, opting out of being LLM training data while staying citable in real-time AI search.

Live Audit · May 8, 2026

What 10 top-ranking robots.txt guides actually ship

Audited the top 5 EN + top 5 DE results for "robots.txt seo" / "robots.txt" via Playwright using Lumina's Schema Validator + Meta Tag Analyzer. The "obituary"-vs-"guide" split is the real story.

9/10

miss FAQPage schema

Only Cloudflare ships it (and ironically ships nothing else — no Article schema). Yoast, SEMrush, Moz, BrightEdge, Vioma, SEO-Küche, Internetwarriors all skip the format AI engines prefer.

7/10

don't cover AI crawlers

Yoast, Cloudflare, BrightEdge, SEO-Küche, Vioma, Internetwarriors — none mention GPTBot, ClaudeBot, or PerplexityBot. Heise's "obituary" mentions them but only to declare robots.txt dead. SEMrush + Moz cover them lightly.

403d

most-stale top-5 guide

Moz EN's "What Is A Robots.txt File?" was last updated March 31, 2025. SEMrush 2025-07-30 (282d). Yoast EN refreshed December 2025 (143d). DE Heise "Nachruf" October 2025 (205d) — and arguing the protocol is dead.

3/10

no Article schema at all

SEO-Küche ships only BreadcrumbList. Vioma ships only LocalBusiness. Cloudflare DE ships only FAQPage. Three top-10 results for the head term, zero Article schema between them.

AI user-agents to know

GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, Claude-User, Claude-SearchBot, anthropic-ai, PerplexityBot, Perplexity-User, Google-Extended, Google-Agent, Applebot-Extended, CCBot, Bytespider, Meta-ExternalAgent, MistralAI-User, DeepSeekBot, xAI-Web-Crawler, cohere-ai. Most guides cover 3 to 5.

3/10

use @id entity refs

Yoast (EN), Moz (EN), Hostpress (DE) wire author + publisher via @id. Most others ship inline blocks or skip the connection — KI-Citation can't link byline to brand.

Run the same audit on any URL with Lumina's Crawler Access Checker →

The File Format (RFC 9309)

RFC 9309 defines exactly five things in a robots.txt file: User-agent declarations, Allow rules, Disallow rules, the comment character (#), and a few minor formatting rules. Everything else (Crawl-delay, Sitemap, Host) is non-standard but widely supported. The specification is short enough to read in fifteen minutes and answers most edge cases your team will argue about.

Here's a minimal valid robots.txt that demonstrates every directive a typical site needs:

User-agent: *
Disallow: /admin/
Disallow: /search?
Disallow: /tmp/
Allow: /admin/help/

User-agent: Googlebot
Disallow: /staging/

Sitemap: https://example.com/sitemap.xml

The file reads top to bottom. Each User-agent line opens a group, and the rules below it apply to that user-agent until the next User-agent line. The first group with a matching user-agent wins (most specific match). Wildcards work for path matching: * matches any sequence of characters, $ matches the end of the URL. So Disallow: /*.pdf$ blocks every PDF on the site.

The Allow directive overrides a Disallow within the same group. The order doesn't matter — the longest matching rule wins. So Disallow: /admin/ followed by Allow: /admin/help/ means "block the admin area but let crawlers fetch the help section." Most sites get this wrong by putting the Allow first; it doesn't change behavior, but it confuses readers.

Three things that look like part of the spec but aren't: Crawl-delay (Bing and Yandex respect it; Google ignores it; the value is seconds between requests), Host (Yandex uses it for canonical domain selection; nobody else does), and the noindex robots.txt directive that some old guides reference (Google removed support in September 2019 — never reliable, never coming back).

The AI Crawler Layer: 19 New User-Agents

The AI crawler population grew from zero to nineteen between 2022 and 2026. Each operator has its own user-agents, often split by purpose. The split that matters most is training versus retrieval: training crawlers fetch your content to teach future models; retrieval crawlers fetch live during user queries and provide the citations.

Operator	User-agent	Purpose
OpenAI	`GPTBot`	Training crawler for future GPT models
OpenAI	`ChatGPT-User`	On-demand fetch when a ChatGPT user asks a question
OpenAI	`OAI-SearchBot`	Indexing crawler for ChatGPT Search results
Anthropic	`ClaudeBot`	Training crawler for future Claude models
Anthropic	`Claude-User`	On-demand fetch when a Claude user invokes web search
Anthropic	`Claude-SearchBot`	Indexing crawler for Claude's web search
Anthropic	`anthropic-ai`	Legacy training user-agent (still active)
Perplexity	`PerplexityBot`	Indexing crawler for Perplexity's search index
Perplexity	`Perplexity-User`	On-demand fetch when a Perplexity user asks a question
Google	`Google-Extended`	Training opt-out for Gemini and Vertex AI
Google	`Google-Agent`	Project Mariner agent on behalf of a logged-in user
Apple	`Applebot-Extended`	Training opt-out for Apple Intelligence
Common Crawl	`CCBot`	Public dataset used by many open-source LLMs
ByteDance	`Bytespider`	Training crawler for TikTok / Doubao models
Meta	`Meta-ExternalAgent`	Crawler for Meta AI features
Mistral	`MistralAI-User`	On-demand fetch from Le Chat (Mistral's UI)
DeepSeek	`DeepSeekBot`	Training crawler for DeepSeek models
xAI	`xAI-Web-Crawler`	Crawler for Grok
Cohere	`cohere-ai`	Training crawler for Cohere models

The training-vs-retrieval distinction is the actionable one. If you want to opt out of being LLM training data but stay citable in AI search, block the training crawlers and allow the retrieval ones. Concrete example for OpenAI: Disallow: GPTBot blocks training, Allow: OAI-SearchBot, ChatGPT-User keeps you citable in ChatGPT Search. Anthropic works the same way: block ClaudeBot and anthropic-ai for training opt-out, allow Claude-SearchBot and Claude-User for live citation. This is the configuration most publishers actually want, and it's the configuration most existing robots.txt files don't have because they pre-date the user-agent split.

The 5 Patterns That Work

Five robots.txt configurations cover roughly 90% of real-world sites. Each solves a different problem; most sites need a combination of two or three. The pattern you start with depends on whether your priority is crawl budget, AI training opt-out, AI retrieval visibility, or content security (with the caveat that robots.txt is the wrong tool for security).

1. The minimal everything-allowed file

For a site with no specific blocking needs, two lines do the job:

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

This explicitly says "all crawlers welcome, full site allowed." The Sitemap line points crawlers to your sitemap.xml. Nothing else is needed. Many small sites overcomplicate their robots.txt; this is the right baseline.

2. Crawl budget protection

For sites with faceted navigation, internal search, calendars, or session-tagged URLs, target the high-noise paths:

User-agent: *
Disallow: /search?
Disallow: /*?session=
Disallow: /*?utm_
Disallow: /tag/
Disallow: /author/

Sitemap: https://example.com/sitemap.xml

This keeps Googlebot focused on canonical content. Aim for blocking parameter explosions and never block the actual product, article, or service pages.

3. AI training opt-out, retrieval allowed

The 2026 default for most publishers — opt out of being training data, stay citable in AI search:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

# Retrieval bots stay allowed — citations in AI answers welcome
User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

4. Full AI block

For brand or legal reasons (paywalled news, copyrighted IP, regulated content), block all AI crawlers including retrieval:

User-agent: GPTBot
User-agent: ChatGPT-User
User-agent: OAI-SearchBot
User-agent: ClaudeBot
User-agent: Claude-User
User-agent: Claude-SearchBot
User-agent: anthropic-ai
User-agent: PerplexityBot
User-agent: Perplexity-User
User-agent: Google-Extended
User-agent: Applebot-Extended
User-agent: CCBot
User-agent: Bytespider
User-agent: Meta-ExternalAgent
User-agent: MistralAI-User
User-agent: DeepSeekBot
User-agent: xAI-Web-Crawler
User-agent: cohere-ai
Disallow: /

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

Note that this only stops the named bots. Common Crawl's CCBot scrapes for many downstream open-source models, so blocking CCBot indirectly affects models you've never heard of. For hard enforcement, layer this with a Cloudflare WAF rule or a server-side block.

5. Section-specific rules

For sites where some sections should block AI but others should allow it (e.g., a marketing site that allows AI on the blog but blocks it on the docs):

User-agent: GPTBot
Disallow: /docs/
Allow: /

User-agent: ClaudeBot
Disallow: /docs/
Allow: /

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

The longest-matching-rule logic means Disallow on /docs/ wins over Allow on /, so the docs section is blocked for the named bots while the rest of the site stays open.

The 6 Most Common Mistakes

Six recurring mistakes account for the bulk of robots.txt issues I see in audits. None of them are subtle, all of them are common, and each one costs traffic, crawl budget, or AI visibility. Most production robots.txt files I audit ship at least two of the six. The fixes below take an afternoon to apply.

1. Disallowing a page you want to deindex

The most common mistake on the entire web. To remove a page from Google, the meta noindex tag is the right tool — but if the page is also disallowed in robots.txt, Googlebot won't fetch the page, won't see the noindex tag, and may keep the URL in the index based on inbound links. The correct sequence: leave the page crawlable, add noindex, wait for Google to re-crawl and drop it, then optionally add the disallow back if you want to block future crawls.

2. Blocking CSS or JavaScript

Older robots.txt files often block /wp-content/, /assets/, or /js/ for "speed" reasons. This breaks Google's ability to render the page properly. Googlebot needs CSS and JS to evaluate mobile-friendliness, layout shift, and content visibility. Modern Googlebot rendering depends on full asset access. Allow CSS and JS unless you have a very specific reason to block, and never block the path that contains your CMS asset bundles.

3. Using robots.txt as a security layer

Robots.txt is publicly readable. Listing /admin/, /staging/, or /backup/ in your disallow rules tells anyone curious enough to fetch yoursite.com/robots.txt exactly which paths exist and which paths you'd rather they didn't visit. Use HTTP authentication, IP allowlists, or VPN-only access for anything you actually want protected. Reserve robots.txt for paths you simply don't want crawled — not paths you don't want discovered.

4. Wildcards in user-agent names

RFC 9309 allows wildcards in path matching but not in user-agent names. Writing User-agent: GPT* doesn't match GPTBot — it matches a literal user-agent string of "GPT*", which no bot uses. Each AI crawler needs to be listed by its exact name. Yes, that means up to 19 User-agent lines if you want to block them all (see pattern 4 above).

5. Conflicting rules across multiple files

A robots.txt file lives at exactly one location: the root of the domain (yoursite.com/robots.txt). If you have a CMS that auto-generates one and a static file in your repo, you may end up with two competing files. Browsers will fetch whichever is served first. Audit method: fetch yoursite.com/robots.txt yourself and confirm what you actually serve matches what you intend.

6. Forgetting the Sitemap directive

The Sitemap directive in robots.txt is the most reliable way to tell every crawler where to find your sitemap. Most CMSes don't add it automatically, so you have to. One line at the bottom: Sitemap: https://yoursite.com/sitemap.xml. Add additional Sitemap lines if you have multiple sitemaps (one per language, one per content type). This costs nothing and helps every crawler discover your full URL set.

Robots.txt vs. Noindex vs. WAF: Which Tool When

Robots.txt, the noindex meta tag, and a server-side WAF block solve different problems and combining them in the wrong order produces predictable failures. The right tool depends on whether you want the bot to fetch the URL, see the URL, or simply not find the URL at all.

Tool	What it does	When to use
robots.txt Disallow	Tells well-behaved bots not to fetch the URL. Doesn't prevent indexing if other sites link to the URL.	Crawl budget control. Faceted nav, search results, calendar pages, session URLs. Path-level rules.
noindex meta tag	Tells indexers to fetch the URL but not include it in search results. Per-page directive.	Pages you want kept out of the index (thank-you pages, low-value templates, draft content). Page-level fine control.
X-Robots-Tag header	Same as noindex but applies via HTTP header to non-HTML files (PDFs, images). Server-level.	De-indexing PDFs, downloads, or other binary files where a meta tag isn't possible.
WAF / firewall block	Refuses the bot's request at the network layer. Hard enforcement.	Bots that ignore robots.txt. Scrapers without declared user-agents. Paywalled content. Aggressive AI training opt-out.
HTTP authentication	Requires a valid credential before serving any content. Network + application layer.	Genuinely private content. Staging environments. Internal tools.

The most common combination error: robots.txt Disallow + noindex on the same URL. The disallow prevents the noindex from being seen, so the URL can still appear in Google's index without a snippet. If you want a URL out of the index, leave it crawlable, add noindex, and only consider disallow once Google has re-crawled and dropped it (typically a few weeks).

How to Audit Your Robots.txt (3 Methods)

Three audit methods catch the bulk of robots.txt issues. Each takes under fifteen minutes for a typical site, and running all three on a critical domain gives you cross-validation. Most production robots.txt files I audit have at least one issue from this list.

Method 1: Lumina Crawler Access Checker

Paste any URL into Lumina's Crawler Access Checker and the tool fetches your robots.txt, parses every rule, and checks the access status for 36 distinct crawlers (19 AI plus 17 classical search bots). For each bot you get a clear allowed-or-blocked verdict for the page you submitted, plus the matching rule that produced the verdict. Use it to verify that your training opt-out actually blocks the bots you think it blocks and that your retrieval allowlist actually allows the bots you want citing you.

Method 2: Google Search Console robots.txt report

In Search Console, go to Settings → Crawling → robots.txt. The report shows the version of your robots.txt that Google last fetched, any parsing errors, and a tester that lets you check whether Googlebot can fetch a specific URL under the current rules. Use it as the source-of-truth for what Google actually sees, especially after a deploy when caching can lag for up to 24 hours.

Method 3: Direct fetch + manual review

Open yoursite.com/robots.txt in a browser. Confirm: the file exists, returns HTTP 200, has a Sitemap line at the bottom, lists user-agents you actually want to block, and has no surprises (CMS-injected paths, leftover staging rules, or duplicate User-agent groups). Run this manual check at least quarterly — robots.txt is exactly the kind of file that gets quietly modified by plugins or deploy scripts and never reviewed.

AI Crawlers and the New Compliance Reality

Compliance with robots.txt by AI crawlers is voluntary, like every robots.txt rule, but the major operators have publicly committed to honor it for their declared user-agents. OpenAI documented GPTBot's compliance in August 2023; Anthropic, Perplexity, Google, and Apple followed with similar statements. Cloudflare's 2024 reports indicated broadly high compliance for the named bots, with documented exceptions (Perplexity was caught ignoring robots.txt by Wired in June 2024).

The gap is what happens when content gets fetched by a bot that doesn't respect robots.txt at all. Common Crawl's CCBot is technically compliant — it honors disallow rules — but the public dataset it produces is consumed by dozens of downstream AI models, some of which never identify themselves on your site. Blocking CCBot is the closest robots.txt comes to a "block training generally" option, but even that doesn't catch every downstream use.

The deeper compliance question: even when an AI lab publicly commits to robots.txt, the commitment binds the named user-agent. A research team within the same company can run an unnamed crawler and the official robots.txt commitment doesn't apply to it. This is why the publishers who want hard enforcement layer Cloudflare's AI Bot Block (which blocks at the network level) on top of their robots.txt rules. Robots.txt is a polite request; the WAF is the enforcement.

For most sites, the polite request is enough. The major AI labs have business reasons to honor it (regulatory pressure, brand risk, PR), and the data shows they do. But "enough for most sites" is not "enough for all sites." Publishers with high-value paywalled content, regulated industries, and brand-sensitive properties should run robots.txt + WAF together rather than relying on robots.txt alone.

A 5-Step Robots.txt Workflow for 2026

Five steps will give you a clean, modern robots.txt setup in a focused afternoon. The first three are diagnostic; the last two are configuration. Most sites complete the whole workflow in two or three hours, including time for a real audit + a careful policy decision before any rules ship.

Audit your current file

Fetch yoursite.com/robots.txt directly. Run it through Lumina's Crawler Access Checker. Note: which AI crawlers you currently block, which you allow, whether the Sitemap line is present, whether any rules look stale.

Run Crawler Access Checker →

Decide your AI policy

Three choices: full allow, training opt-out + retrieval allow, full AI block. Most publishers want option two. Document the choice with one sentence in your team docs so the file's intent is obvious to future maintainers.

See the 5 patterns →

Audit crawl-budget waste

Open Search Console crawl stats. Look for high-frequency fetches on parameter URLs, internal search, calendar pages, session-tagged URLs. Each one is a candidate for a Disallow rule that frees budget for your real content.

Verify with Sitemap Analyzer →

Write the new file

Build the file from the patterns above. Test syntax with Google Search Console's robots.txt tester. Keep it short — 30 to 60 lines is normal for most sites. Avoid the temptation to copy a 500-line robots.txt from a top site you admire.

Re-test before deploy →

Deploy and verify

Deploy the file. Wait 24 hours for caches to clear. Re-run Lumina's Crawler Access Checker on a sample of pages to confirm the new rules apply as intended. Set a quarterly recurring reminder to re-audit — robots.txt is exactly the kind of file that quietly drifts.

Verify after deploy →

FAQ

What is robots.txt and why does it matter?+

Robots.txt is a plain-text file at the root of your domain (yoursite.com/robots.txt) that tells crawlers which paths they may fetch and which they should leave alone. The format is standardized in RFC 9309 (published September 2022). It matters in 2026 because the crawler population exploded: alongside Googlebot and bingbot, you now have at least 19 distinct AI crawlers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Claude-SearchBot, OAI-SearchBot, Applebot-Extended, CCBot, Bytespider, and a dozen more) — each with its own user-agent string and its own job.

Does robots.txt block AI crawlers from training on my content?+

Mostly yes, but only by convention. Robots.txt is a voluntary protocol — there is no technical enforcement. Major AI labs (OpenAI, Anthropic, Google, Perplexity, Apple) have publicly committed to honor robots.txt for their declared user-agents, and Cloudflare's 2024 data showed broadly high compliance among the named bots, with documented exceptions (notably Perplexity, caught ignoring robots.txt in Wired's June 2024 reporting). But the protocol cannot stop a bot that ignores it, and it does not stop a different bot from re-fetching your content (Common Crawl's CCBot, for example, scrapes for many downstream models). For hard enforcement you need a server-side block (firewall rule, Cloudflare WAF, or Bot Fight Mode) — not a robots.txt entry.

What is the difference between disallow in robots.txt and noindex on a page?+

Disallow tells crawlers not to fetch the URL. Noindex tells indexers to fetch the URL but not include it in search results. They solve different problems and combining them incorrectly is the most common robots.txt mistake. If you disallow a page in robots.txt, Google will not fetch it, will not see the noindex meta tag, and may still keep the URL in the index (without any title or description) if other sites link to it. The correct pattern for keeping a page out of Google: leave it crawlable, add a noindex meta tag, and let Google fetch + drop it. Disallow only when you specifically don't want the bot to spend crawl budget on the path (e.g., faceted navigation, internal search results).

Should I use crawl-delay to slow down Googlebot?+

No. Google has stated repeatedly that Googlebot ignores the crawl-delay directive. Bing and Yandex respect it (with their own interpretation), but for Google you control crawl rate via the Search Console crawl-rate setting (which Google announced in November 2023 and removed in January 2024 — the system now adapts automatically). If you have a server load problem from Googlebot, the right tools are: a server-side rate limit, a 503 response with Retry-After header during peak load, or contacting Google through Search Console's reporting flow. Crawl-delay is mostly cosmetic for Google in 2026.

Can I use robots.txt to hide private content?+

No. Robots.txt is publicly readable — anyone can fetch yoursite.com/robots.txt and see every path you've disallowed. Listing /admin or /staging in your disallow rules effectively advertises those paths to anyone curious enough to look. For private content, use HTTP authentication (basic auth, OAuth, or session cookies), IP allowlists, or VPN-only access. Use robots.txt to control crawl budget for routine pages you'd rather Google not waste fetches on, never as a security boundary.

How do AI search engines use my robots.txt differently from Google?+

Three differences matter in 2026. First, AI training crawlers (GPTBot, ClaudeBot, anthropic-ai, CCBot) gather content for model training — blocking them affects whether future models know your site exists. Second, AI retrieval crawlers (OAI-SearchBot, ChatGPT-User, Claude-SearchBot, PerplexityBot, Perplexity-User, Google-Extended) fetch live during user queries — blocking them affects whether ChatGPT, Claude, Perplexity, or Gemini can cite your site in real-time answers. Third, agent crawlers like Google-Agent (Project Mariner) navigate sites on behalf of a logged-in user and respect robots.txt the same way a regular user-agent does, but with different identification. The training/retrieval/agent split lets you choose: opt out of training (block GPTBot, ClaudeBot) while still being citable in real-time (allow OAI-SearchBot, Claude-SearchBot, PerplexityBot).

Where to Start

If you can do exactly one thing this week, fetch yoursite.com/robots.txt and run it through Lumina's Crawler Access Checker. Most sites I audit have at least one of three issues: a stale rule still blocking a section that should be open, a missing AI crawler block, or a missing Sitemap directive. Fixing those three things takes an afternoon.

If you have more time, work through the 5-step workflow above. The biggest gain for most publishers in 2026 is the training opt-out plus retrieval allow pattern. It tells the AI labs you've thought about your data, opts you out of training datasets that you don't benefit from, and keeps you citable in real-time AI answers — which is where AI-driven traffic actually lives. Both updates ship in the same file, the deploy is one commit, and the effect is visible in Crawler Access Checker results within minutes.

Audit your robots.txt now

Lumina's Crawler Access Checker tests 36 distinct crawlers (19 AI + 17 classical) against any URL — no signup, free.