For 50 of the largest media sites in Germany, Austria and Switzerland I checked two things at once: what the robots.txt says (the official policy) and what the server actually delivers when an AI bot shows up with its real user-agent. Two layers, two different signals. Some outlets say the same thing at both layers. Others contradict themselves. All via one tool: Lumina's Crawler Access Checker, which tests 36 bots at once — 19 of them AI crawlers (ChatGPT, Claude, Perplexity, Gemini, Mistral, DeepSeek etc.), 9 classical search engines, and 8 others (social & misc).

The result is a big contradiction: Every site lets Google in. Half block AI.

Here's the data — plus a few findings that surprised me.

Two layers, two signals

robots.txt is the policy layer: a text file where the outlet says „bot X may crawl me, bot Y may not". Many AI companies respect it — voluntarily. Anyone who doesn't isn't technically stopped.

Server response is the enforcement layer: what the web server actually returns when a bot GETs with its real user-agent. 200 = content comes through, 403 = hard-blocked (Cloudflare, WAF or CDN rule), 402 = Payment Required (a real HTTP paywall against AI — rare).

Most outlets rely purely on robots.txt. A few (Capital.de, kurier.at) go further and enforce at the server. This study measures both.

The core numbers

Of 50 media outlets, how many block each bot via robots.txt (the policy layer — outlets that officially refuse the bot):

Google gets in everywhere, even at the most AI-skeptical sites. GPTBot is shut out of every second one before it even sees the content.

Blocking rate per bot (50 DACH media sites)

% of sites blocking the bot in robots.txt
GPTBot52% ClaudeBot52% anthropic-ai52% CCBot50% cohere-ai48% Applebot-Extended46% Bytespider46% Google-Extended42% Meta-ExternalAgent40% PerplexityBot36% ChatGPT-User34% MistralAI-User24% Googlebot0% 0% 25% 50% 75%

Who blocks the most

The top 5 AI-hostile media outlets (out of 19 AI bot categories):

  1. Tagesschau — 16/19 AI bots blocked
  2. NDR — 16/19
  3. WDR — 16/19
  4. BR — 16/19
  5. NZZ — 14/19

Four of the top five are German ARD public broadcasters. And they all use the exact same robots.txt — clearly an ARD-wide policy.

The ARD vs ZDF paradox

This is where it gets really interesting. Public broadcasters, same industry, same funding model. And yet:

ZDF lets Applebot-Extended, GPTBot, ClaudeBot, Perplexity, Google-Extended and everything else crawl freely. ARD shuts the same bots out completely. Both collect license fees from the same country. Over the next two years, this difference decides whose content gets cited in ChatGPT and Gemini — and whose does not.

Germany twice as aggressive as Austria and Switzerland

Average AI bots blocked per site:

German media are noticeably more defensive. This probably ties back to Axel Springer vs OpenAI, the Leistungsschutzrecht (press publishers' right), and a more active publishers' association. Austrian and Swiss titles mostly still let everything in.

Average blocked AI bots per country

Out of 19 AI bot categories analyzed
8.3 Germany 25 sites 3.8 Austria 18 sites 3.3 Switzerland 7 sites 19 (max) 10 0

Tabloids are more open than quality press

Categories ranked by average AI bots blocked:

The quality press and public broadcasters block most aggressively. Tabloids let AI bots through almost entirely. This makes sense: a sports clip from Krone is more likely to land in an AI answer than a Falter investigative feature — and therefore becomes more visible.

The unnoticed AI crawlers

The numbers above show: the big, well-known bots (GPTBot, Claude, CCBot) get blocked routinely. But:

China's DeepSeek reads 92% of DACH media unblocked. Mistral (the French open-source model) reads 76%. xAI (Grok from Twitter/X) gets in everywhere.

Nearly every blocklist is a copy-paste from 2023. GPTBot was added, Claude was added, done. The newer AI bots are missing from every robots.txt I've seen — except at the four ARD stations, which clearly maintain their list more actively.

The unnecessary contradiction pair

Some media publish legal signals and technical signals that contradict each other:

Krone.at writes in the comment at the top of its own robots.txt: „Use of any device, tool, or process designed to data mine or scrape the content using automated means is prohibited... (1) text and data mining activities under Art. 4 of the EU Directive on Copyright in the Digital Single Market; (2) text and data mining in the meaning of § 42h (6) of the Austrian Copyright Act; (3) the development of any software, machine learning, artificial intelligence (AI), and/or large language models (LLMs)."

That is a legally binding opt-out. And in the same robots.txt, there are technically allow signals for EVERY AI crawler. Because the User-Agent: * block only disallows paths like /navi-content, /forum/*, /sport-navigation — not the root or actual articles.

Which means: Legal opt-out: yes. Technical opt-out: no. Any AI can crawl and train on Krone articles unblocked. The legal claims would have force under Art. 4 of the EU Copyright Directive (which requires machine-readable opt-outs), but most AI companies only look at the technical layer.

Capital.de goes the other direction and pairs the policy block with real enforcement: the server returns HTTP 402 (Payment Required!) for ClaudeBot and anthropic-ai. Not a standard robots.txt block — the server itself enforces the paywall at the HTTP layer. Rare.

Kurier.at flips the direction again: robots.txt allows Bingbot, YandexBot and Baiduspider. The server delivers a 403 for exactly those three. The policy says „come in", the enforcement says „get out" — for Yandex and Baidu maybe intentional (geopolitics), for Bing more likely a Cloudflare default nobody bothered to fix.

Both cases show: anyone who only looks at the robots.txt sees half the truth. The server response is the signal that actually counts.

What this means for GEO (Generative Engine Optimization)

When a media outlet blocks ChatGPT, Claude and Common Crawl, the following happens:

  1. The content does not appear in ChatGPT search results
  2. It does not get cited in Claude
  3. It is missing from every LLM that trains on Common Crawl data (practically every open-source model plus many commercial ones)
  4. Perplexity answers use other sources

For Tagesschau or Spiegel, this might be strategically correct (content licensing deals with OpenAI instead of free access). For regional media or niche trade magazines without a deal, it means invisibility in the new discovery channel.

And half of the 50 largest DACH media sites have made this call — mostly without actively making it. The robots.txt got updated by an agency at some point in 2023 and has not been touched since.

Methodology (short)

The 36 tested bots in detail

🔍 Classic search engines (9): Googlebot, Bingbot, DuckDuckBot, YandexBot, Baiduspider, Applebot, PetalBot (Huawei), BraveBot, YouBot.

🤖 AI & LLM crawlers (19): GPTBot (OpenAI Training), OAI-SearchBot (OpenAI Search), ChatGPT-User (ChatGPT browser mode), ClaudeBot, Claude-SearchBot, Claude-User (all Anthropic), anthropic-ai (legacy UA), PerplexityBot, Perplexity-User, Google-Extended (Gemini training), Google-Agent (Project Mariner), DeepSeekBot, Meta-ExternalAgent, xAI-Web-Crawler (Grok), MistralAI-User, Applebot-Extended (Apple Intelligence), cohere-ai, CCBot (Common Crawl — foundation for many open-source LLMs), Bytespider (ByteDance/TikTok).

📱 Social & sharing (6): FacebookBot, facebookexternalhit, Pinterest, LinkedInBot, Twitterbot, Slackbot.

🔎 Other (2): Diffbot (web data extraction), Amazonbot.

The bot list covers the three strategically relevant groups: classic search (Google, Bing), AI training & AI search (OpenAI, Anthropic, Perplexity, Google, Apple, Meta, xAI, Mistral, DeepSeek), and social preview bots (for link previews in messaging apps). Who's in here defines where your content will be visible in 2026.

My personal takeaway

I've been doing SEO for 15 years and GEO for two, and I was surprised how little thought went into many of these robots.txt files. ARD followed through consistently. ZDF clearly made a different strategic call. But at many media houses, it feels like the AI blocklist was a quick copy-paste from 2023 that never got revisited.

The same applies in reverse: anyone setting AI blocks in 2026 should have DeepSeek, Mistral, xAI, Google-Agent, MetaExternalAgent and a dozen more on the list. Almost nobody does.

The next round of this race started a long time ago.

FAQ

Which AI crawlers are most often blocked?+
GPTBot, ClaudeBot and anthropic-ai are blocked most frequently — each by 52% of the 50 DACH media sites analyzed. CCBot follows at 50%, cohere-ai at 48%, and Applebot-Extended and Bytespider at 46% each. Googlebot is not blocked by a single site. The asymmetry is systematic, not coincidental.
Why does almost nobody block Googlebot but half block GPTBot?+
Blocking Googlebot means near-total invisibility in classical search — the risk to traffic and revenue is too high. Blocking GPTBot costs media sites almost nothing right now: ChatGPT sends very little referral traffic and publishers want to either license AI training or prevent it entirely. That decision was usually made two years ago and never reviewed since.
What role does robots.txt play in Generative Engine Optimization (GEO)?+
robots.txt decides whether AI systems are even allowed to see your content. Block GPTBot, ClaudeBot and CCBot and you disappear from ChatGPT answers, Claude citations, and every LLM that trains on Common Crawl. For GEO, robots.txt is the first lever — before schema markup, llms.txt or content structure.
Which AI crawlers are often overlooked?+
DeepSeekBot (only 8% of DACH media block it), MistralAI-User (24%), xAI-Web-Crawler (0%) and Google-Agent (0%). Most blocklists were set up in 2023 — these bots did not exist yet. China's DeepSeek model reads 92% of DACH media unblocked, and Mistral reads 76%.
How can I check AI crawler access on my own domain?+
Use the Lumina Crawler Access Checker: enter a URL and the tool checks 36 bots in parallel — classical search engines, AI trainers, social media bots and agents. You see per bot whether it is allowed, rules, partial or blocked, and can inspect the underlying robots.txt rules directly.

Check your own AI crawler access

The same test I ran in this study — for your domain. Free, no signup, 36 bots in one step.

Open the Crawler Access Checker →