An AI crawler is a bot that fetches web pages so a large language model can do something with them. In 2026 that something is one of three jobs: train the next model, populate a search index, or answer a specific user's question live. Each job has its own bot. The gap between "allow everything" and "block everything" is the part most robots.txt guides skip, and it's where the interesting decisions live.

This guide is the full reference: who the crawlers are, what they actually do, how much traffic they generate, and what 10 top news sites have decided — with a live audit done the morning of publication. The canonical companion tool is Lumina's Crawler Access Checker, which runs the same analysis against any URL.

What is an AI crawler?

An AI crawler fetches HTML and sends it downstream to a model that will either memorize it, index it, or quote it. The bot itself looks identical to a classic web crawler — same HTTP requests, same user-agent headers, same respect (usually) for robots.txt. What changes is what happens to the content after it's fetched.

Classic crawlers like Googlebot exist to rank pages against each other and return links in a search result. Users click through. AI crawlers skip the click. ChatGPT reads your page, reasons about the content, and writes an answer that may cite you or may not. That's the shift. Your content stops being a destination and becomes a source.

Because "AI crawler" is a job description rather than a strict category, the list evolves fast. The bots from OpenAI, Anthropic, Google, Perplexity, Apple, Meta, and Common Crawl are the ones that matter in 2026. Vercel's own edge logs (late 2024) show GPTBot, Claude, AppleBot, and PerplexityBot accounting for the overwhelming majority of declared AI-crawler volume across their network. A few dozen smaller crawlers exist, but these top bots are where the interesting decisions live.

The major AI crawlers in 2026

Every public-facing AI system worth tracking declares a user agent. Here are the ones you'll see in your server logs:

BotOperatorPurposeRespects robots.txt
GPTBotOpenAITrains future GPT models on fetched content.Yes
OAI-SearchBotOpenAIBuilds the index ChatGPT Search queries.Yes
ChatGPT-UserOpenAIFetches live when a user asks ChatGPT to browse.No (since OpenAI's Dec 2025 doc update)
ClaudeBotAnthropicTrains future Claude models.Yes
Claude-SearchBotAnthropicPopulates Claude's retrieval index.Yes
Claude-UserAnthropicFetches live when a user asks Claude to browse.Yes
PerplexityBotPerplexityIndexes pages for Perplexity's search surface.Yes
Perplexity-UserPerplexityFetches live when a user asks Perplexity a question.Yes (with exceptions reported in 2024)
Google-ExtendedGoogleOpt-out signal for Gemini training and Google AI features.Yes (directive, not a crawler)
CCBotCommon CrawlPublic dataset used by many LLMs (Llama, Falcon, older GPT).Yes
AppleBot-ExtendedAppleOpt-out signal for Apple Intelligence training.Yes (directive)
Meta-ExternalAgentMetaTraining and product integration for Meta AI.Yes

Two things worth noting. First, Googlebot itself isn't on this list — Googlebot still fetches your pages for classic search, and Google-Extended is the separate opt-out you set to exclude your content from Gemini training. Blocking Google-Extended doesn't affect your ranking in Google Search. Second, Bingbot isn't on this list either, even though ChatGPT Search partially relies on Bing's index. Microsoft has signaled that blocking Bingbot hurts both traditional search and AI search — the two share infrastructure there.

The single biggest 2024 to 2026 shift: OpenAI, Anthropic, and Perplexity all split their one bot into multiple bots with distinct jobs. This is the decision point most robots.txt guides still write around, and it's the one that changes what blocking actually costs you.

OpenAI separated GPTBot from OAI-SearchBot and ChatGPT-User in 2024. Anthropic mirrored the move with Claude-SearchBot and Claude-User. Perplexity runs a two-tier version with PerplexityBot for indexing and Perplexity-User for live retrieval. The pattern is the same everywhere: one bot for training, one for indexing, one for live user-triggered fetches.

Here's why that matters. If you block GPTBot only, you've stopped OpenAI from training on your content — but ChatGPT can still cite you when a user asks it a question, because ChatGPT-User is a different bot. Most publishers want that middle ground: no training, yes citation.

One wrinkle worth knowing. In a December 2025 documentation update, OpenAI removed the robots.txt compliance language from ChatGPT-User and added a note that because those fetches are user-initiated, "robots.txt rules may not apply". Blocking ChatGPT-User in robots.txt no longer works reliably. For training (GPTBot) and search indexing (OAI-SearchBot), robots.txt still does what you'd expect. Anthropic's three bots all still respect robots.txt per their support docs. So the training-vs-retrieval split is still a valid strategy, with a small asterisk on ChatGPT-User specifically.

Concrete example. The Guardian allows OAI-SearchBot and ChatGPT-User through the * fallback while blocking ClaudeBot and CCBot outright. They want to be cited in ChatGPT's live answers but not used in training corpora. The NYT takes the opposite approach: after their 2023 lawsuit against OpenAI, they block every AI bot in existence. Wikipedia takes the third stance — allow everything, because their CC BY-SA license already permits re-use.

How much traffic AI crawlers generate

Cloudflare reported roughly 50 billion AI crawler requests per day across their network in 2025 — about 1% of all web traffic routed through them, and sharply up from 2024. Most of that is training, not answering user queries: Cloudflare's own breakdown shows training-purpose crawls making up nearly 80% of AI bot volume, with GPTBot and ClaudeBot together accounting for about half of all observed AI crawling.

Classic search still dwarfs the AI side. Vercel's edge logs for late 2024 put Googlebot at 4.5 billion requests per month versus GPTBot at 569 million, Claude at 370 million, AppleBot at 314 million, and PerplexityBot at 24 million. That's roughly an 8× gap to GPTBot and nearly 200× to PerplexityBot. AI crawlers are growing fast but haven't caught traditional search in raw volume, and probably won't for years.

Not every crawler follows the rules, either. Per Vercel's own data the major declared bots — GPTBot, Claude, AppleBot — respect robots.txt. The problem is the undeclared ones. A Wired investigation in June 2024 caught Perplexity bypassing robots.txt via an unnamed AWS-hosted crawler; Cloudflare publicly called out the same pattern in August 2025. If a crawler has decided to scrape you, robots.txt won't stop it. For real enforcement you need the WAF or IP-block layer.

Live audit: what 10 top sites actually block

I pulled robots.txt from 10 high-traffic sites on the morning of publication and parsed each for Allow/Disallow directives across 12 AI bots. The pattern is clear: training bots get blocked roughly three times as often as retrieval bots. Here's what the numbers say.

Live Audit · 2026-04-20

10 top sites, 12 AI bots, one clear pattern.

Ran robots.txt parsing against 10 high-traffic sites: nytimes.com, wsj.com, bbc.com, cnn.com, reuters.com, theguardian.com, spiegel.de, zeit.de, wikipedia.org, medium.com. Each site's explicit rules and * fallback classified against 12 major AI bots.

50%
block ClaudeBot outright
NYT, WSJ, BBC, Spiegel, Zeit all block the Anthropic training crawler. Only 10% block Claude-SearchBot. The split is working — publishers know the difference.
40%
block GPTBot
4 of 10 block the OpenAI training bot. 7 of 10 let OAI-SearchBot through — explicit training refusal, explicit citation invitation. The exact pattern the 2024 bot split enabled.
10/10
blocked by NYT
The New York Times robots.txt blocks every AI bot we checked, including Google-Extended and Meta-ExternalAgent. Post-lawsuit posture: zero tolerance for AI access, period.
0/10
blocked by Wikipedia
Wikipedia allows every AI bot via its * fallback. Their Creative Commons license already permits re-use, so they have no policy reason to block. The opposite decision to NYT.
50%
block CCBot (Common Crawl)
Half the audited sites block CCBot — the bot that feeds Llama, Falcon, older GPT, and dozens of academic models. Blocking GPTBot without blocking CCBot is a common gap.
3×
training vs. retrieval gap
Across the audit, training-focused bots (GPTBot, ClaudeBot, CCBot) are blocked at 47% — roughly three times the 13% block rate on retrieval-focused bots (OAI-SearchBot, Claude-SearchBot, Claude-User). The split bots work.

Run the same audit on any URL →

The second-order finding: of the 10 sites, 8 use the three-tier approach (block training, allow retrieval, allow user browsing). Only NYT is all-blocks and only Wikipedia is all-allows. The middle-ground strategy is now the mainstream-publisher default.

Should you block AI crawlers?

Not a yes/no question. It depends on three things: how you make money, whether your content is substitutable, and whether AI citation actually drives traffic to you today.

If you're a news publisher behind a paywall, the logic that drove the NYT lawsuit applies. Training bots learn your content and a few months later the model paraphrases your reporting without sending you a reader. Blocking training is defensive. Allowing retrieval is the open question: you show up as a cited source in ChatGPT and Perplexity answers, but the referral traffic is still a tiny fraction of organic Google. Anthropic's own crawl-to-refer ratio (per Cloudflare's August 2025 data) was roughly 2,500:1 on news sites — they crawl a lot and send little back.

If you run a marketing or lead-gen site, the opposite math applies. You want to be mentioned in ChatGPT answers about your category. Blocking AI bots cuts you out of that channel. Allow everything, including training — the downside of training is minimal when your content is a sales pitch, and the upside of citation is real. Lumina itself takes this stance.

If you're an affiliate site, documentation site, or comparison review, the retrieval angle dominates. People ask ChatGPT "what's the best X" and the answer cites 3-5 sources. If you're not one of them, you don't exist. Allow OAI-SearchBot, Claude-SearchBot, ChatGPT-User, Claude-User, Perplexity-User, and Perplexity-Bot. You can block GPTBot and ClaudeBot if you want to push back on training, but the retrieval bots are the ones doing you favors.

Default stance if you're unsure: block GPTBot, ClaudeBot, CCBot, and Google-Extended. Allow everything else. This is the middle position and it mirrors what most mainstream publishers have landed on.

The robots.txt reference for AI crawlers

Three copy-pasteable configurations, from most permissive to most restrictive. Put the file at /robots.txt at your domain root. User-agent matching is case-insensitive. Rules apply to paths, which are case-sensitive.

1. Allow everything (Wikipedia-style)

User-agent: *
Disallow:

Use this if your content is open-licensed, or you actively want citation and don't mind training. Most B2B SaaS and marketing sites should be here.

2. Block training, allow retrieval (the middle ground)

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: AppleBot-Extended
Disallow: /

User-agent: *
Disallow:

This blocks every training bot we've documented and lets retrieval bots through via the * fallback. OAI-SearchBot, Claude-SearchBot, ChatGPT-User, Claude-User, PerplexityBot, and Perplexity-User all still have access.

3. Block everything AI (the NYT approach)

User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-SearchBot
Disallow: /
User-agent: Claude-User
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Perplexity-User
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: AppleBot-Extended
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Googlebot
Allow: /

User-agent: *
Allow: /

The explicit Googlebot: Allow matters here. Without it, the list can read as hostile to all bots. You want classic Google search to continue crawling — the block is AI-only.

One rule to remember

robots.txt is a request, not a firewall. Well-behaved crawlers respect it — Vercel's own data shows the major declared bots (GPTBot, Claude, AppleBot) all comply. Scrapers pretending to be Chrome, or crawlers operating under undisclosed user agents, won't. For real enforcement, pair robots.txt with Cloudflare's AI-scraper blocking feature, an IP deny list, or a WAF rule that challenges suspicious traffic.

Common mistakes

Six patterns I see in client audits and competitor robots.txt inspections:

  • Blocking GPTBot but leaving CCBot open. Common Crawl feeds Llama, Falcon, older GPT, and dozens of academic models. If your goal is "no training use of my content," you need to block both.
  • Blocking ChatGPT-User. This removes you from ChatGPT's live answer citations. If that's the goal, fine. If you only wanted to block training, you blocked the wrong bot.
  • Wildcard blocks catching legitimate bots. A line like User-agent: *bot Disallow: / isn't valid robots.txt syntax and will be ignored. Worse, Disallow: /*bot* can block arbitrary URLs. Use explicit user-agent names.
  • Trusting robots.txt as enforcement. The crawlers you most want to stop are the ones most likely to ignore robots.txt — Perplexity's unnamed AWS crawler, scrapers with faked user agents, undeclared bots. The polite bots respect it. The aggressive ones don't. Layer in WAF or IP blocks if enforcement matters.
  • Forgetting to deploy. robots.txt lives at the domain root. Subfolder CMSes sometimes generate a robots.txt that never gets served. Test with curl https://yourdomain.com/robots.txt — if that returns 404 or the wrong file, nothing you wrote matters.
  • Bumping robots.txt without checking the response. Some CDNs cache robots.txt for 24 hours. If you update rules at 10am and the cache doesn't clear until midnight, you've shipped nothing for 14 hours.

FAQ

What is an AI crawler?+
An AI crawler is a bot that fetches web pages on behalf of a large language model. The LLM uses what it fetched for one of three jobs: training on the content, building a real-time search index, or answering a specific user's question live. Each job has its own dedicated bot in 2026 — GPTBot trains, OAI-SearchBot indexes, ChatGPT-User browses for a user.
Should I block AI crawlers from my website?+
It depends on whether you want to be cited in AI answers. Block training bots (GPTBot, ClaudeBot, CCBot, Google-Extended) if you don't want your content used to train future models. Keep search and user-browsing bots (OAI-SearchBot, Claude-SearchBot, ChatGPT-User, Perplexity-User) allowed if you want to show up as a source when someone asks ChatGPT or Perplexity a question. Blocking everything removes you from the AI search index entirely.
Does robots.txt actually stop AI crawlers?+
For the major declared bots, yes. Vercel's own edge-log data shows GPTBot, Claude, and AppleBot all respecting robots.txt. The problem is the undeclared crawlers. A Wired investigation in June 2024 caught Perplexity bypassing robots.txt via an unnamed AWS-hosted crawler, and Cloudflare publicly called out the same pattern in August 2025. If a crawler has decided to scrape you, robots.txt won't stop it. For hard enforcement you need the WAF or IP-block layer.
What is the difference between GPTBot and ChatGPT-User?+
GPTBot fetches content to train future OpenAI models. ChatGPT-User fetches a single page in response to a specific user asking ChatGPT to browse the web. OpenAI separated them in 2024 so publishers could opt out of training without losing citation in ChatGPT's live-browsing answers. One important 2025 change: GPTBot still respects robots.txt, but OpenAI's December 2025 doc update says ChatGPT-User does not — because the fetch is user-initiated, robots.txt rules "may not apply". Blocking ChatGPT-User via robots.txt no longer works reliably.
Is CCBot an AI crawler?+
Indirectly, yes. CCBot is the Common Crawl bot. Common Crawl is a public dataset that many LLMs — including older versions of GPT, Llama, Falcon, and dozens of academic models — have used as training data. Blocking GPTBot without blocking CCBot means your content still ends up in training corpora through the back door. Most sites that block AI training bots also block CCBot.
What does Google-Extended do?+
Google-Extended is an opt-out signal, not a separate crawler. Googlebot still fetches your pages for classic search. The Google-Extended directive in robots.txt tells Google not to use your content for training Gemini and Google's AI features. Blocking Google-Extended doesn't affect your ranking in Google Search — it only affects AI training.

Where to start

If you want working AI-crawler rules on your site this week, do these five things in order:

Audit your current robots.txt

Run Lumina's Crawler Access Checker on your domain. It checks 36 bots against your live robots.txt and the actual server response. Free, no signup.

Crawler Access Checker →
Pick your stance

Allow all (marketing, B2B SaaS, documentation), block training only (most news and publishers), or block everything (newsrooms with paywall and legal posture like NYT). Most sites want the middle ground.

Decision framework ↑
Deploy the robots.txt

Copy the config above that matches your stance. Put it at /robots.txt. Verify it with curl. Check CDN caching — Cloudflare and Fastly may hold the old file for hours.

Copy the config ↑
Add an llms.txt if you allow access

If you're allowing retrieval, a clean llms.txt file helps AI crawlers find your canonical pages faster. No documented ranking effect yet, but it's a low-cost signal.

llms.txt Generator →
Track AI referral traffic in GA4

Set up source tracking for chatgpt.com, perplexity.ai, claude.ai, and gemini.google.com. Volume is small in 2026 but the trend tells you whether your retrieval strategy is paying off.

GA4 Dashboard →

Check how AI crawlers see your site

Lumina's free Crawler Access Checker tests your robots.txt against 36 bots — including every major AI crawler. One URL, no signup, real server responses alongside robots.txt rules.

Run the Crawler Access Checker →