For 50 of the largest media sites in Germany, Austria and Switzerland I checked two things at once: what the robots.txt says (the official policy) and what the server actually delivers when an AI bot shows up with its real user-agent. Two layers, two different signals. Some outlets say the same thing at both layers. Others contradict themselves. All via one tool: Lumina's Crawler Access Checker, which tests 36 bots at once — 19 of them AI crawlers (ChatGPT, Claude, Perplexity, Gemini, Mistral, DeepSeek etc.), 9 classical search engines, and 8 others (social & misc).
The result is a big contradiction: Every site lets Google in. Half block AI.
Here's the data — plus a few findings that surprised me.
Two layers, two signals
robots.txt is the policy layer: a text file where the outlet says „bot X may crawl me, bot Y may not". Many AI companies respect it — voluntarily. Anyone who doesn't isn't technically stopped.
Server response is the enforcement layer: what the web server actually returns when a bot GETs with its real user-agent. 200 = content comes through, 403 = hard-blocked (Cloudflare, WAF or CDN rule), 402 = Payment Required (a real HTTP paywall against AI — rare).
Most outlets rely purely on robots.txt. A few (Capital.de, kurier.at) go further and enforce at the server. This study measures both.
The core numbers
Of 50 media outlets, how many block each bot via robots.txt (the policy layer — outlets that officially refuse the bot):
- GPTBot: 26/50 (52%) — OpenAI's training crawler
- ClaudeBot: 26/50 (52%) — Anthropic's crawler
- anthropic-ai: 26/50 (52%) — Anthropic's legacy name
- CCBot: 25/50 (50%) — Common Crawl, basis for many LLM training sets
- cohere-ai: 24/50 (48%)
- Applebot-Extended: 23/50 (46%) — Apple Intelligence training
- Bytespider: 23/50 (46%) — ByteDance/TikTok
- Google-Extended: 21/50 (42%) — Gemini training opt-out
- Meta-ExternalAgent: 20/50 (40%)
- PerplexityBot: 18/50 (36%)
- Googlebot: 0/50 (0%) — Not a single site blocks Google fully
Google gets in everywhere, even at the most AI-skeptical sites. GPTBot is shut out of every second one before it even sees the content.
Blocking rate per bot (50 DACH media sites)
Who blocks the most
The top 5 AI-hostile media outlets (out of 19 AI bot categories):
- Tagesschau — 16/19 AI bots blocked
- NDR — 16/19
- WDR — 16/19
- BR — 16/19
- NZZ — 14/19
Four of the top five are German ARD public broadcasters. And they all use the exact same robots.txt — clearly an ARD-wide policy.
The ARD vs ZDF paradox
This is where it gets really interesting. Public broadcasters, same industry, same funding model. And yet:
- ARD (Tagesschau, NDR, WDR, BR): 16/19 AI bots blocked
- ZDF: 0/19 blocked
- SRF (Switzerland): 0/19 blocked
- ORF (Austria): 8/19
ZDF lets Applebot-Extended, GPTBot, ClaudeBot, Perplexity, Google-Extended and everything else crawl freely. ARD shuts the same bots out completely. Both collect license fees from the same country. Over the next two years, this difference decides whose content gets cited in ChatGPT and Gemini — and whose does not.
Germany twice as aggressive as Austria and Switzerland
Average AI bots blocked per site:
- DE: 8.3 out of 19
- AT: 3.8 out of 19
- CH: 3.3 out of 19
German media are noticeably more defensive. This probably ties back to Axel Springer vs OpenAI, the Leistungsschutzrecht (press publishers' right), and a more active publishers' association. Austrian and Swiss titles mostly still let everything in.
Average blocked AI bots per country
Tabloids are more open than quality press
Categories ranked by average AI bots blocked:
- Weekly magazines (Zeit, Falter): 10.5 blocked
- Public broadcasters: 10.3
- Tech (Heise, Golem, t3n): 8.3
- Business (Handelsblatt, Capital, WiWo, Manager Magazin, Trend): 7.8
- Daily newspapers: 6.5
- Regional: 6.2
- Magazines: 2.2
- Tabloids (Bild, Krone, Heute, oe24, Blick): 2.2
- TV (ServusTV, Puls24, n-tv): 0.0
The quality press and public broadcasters block most aggressively. Tabloids let AI bots through almost entirely. This makes sense: a sports clip from Krone is more likely to land in an AI answer than a Falter investigative feature — and therefore becomes more visible.
The unnoticed AI crawlers
The numbers above show: the big, well-known bots (GPTBot, Claude, CCBot) get blocked routinely. But:
- DeepSeekBot: only 4/50 (8%) blocked
- MistralAI-User: only 12/50 (24%)
- xAI-Web-Crawler: 0/50
- Google-Agent: 0/50
China's DeepSeek reads 92% of DACH media unblocked. Mistral (the French open-source model) reads 76%. xAI (Grok from Twitter/X) gets in everywhere.
Nearly every blocklist is a copy-paste from 2023. GPTBot was added, Claude was added, done. The newer AI bots are missing from every robots.txt I've seen — except at the four ARD stations, which clearly maintain their list more actively.
The unnecessary contradiction pair
Some media publish legal signals and technical signals that contradict each other:
Krone.at writes in the comment at the top of its own robots.txt: „Use of any device, tool, or process designed to data mine or scrape the content using automated means is prohibited... (1) text and data mining activities under Art. 4 of the EU Directive on Copyright in the Digital Single Market; (2) text and data mining in the meaning of § 42h (6) of the Austrian Copyright Act; (3) the development of any software, machine learning, artificial intelligence (AI), and/or large language models (LLMs)."
That is a legally binding opt-out. And in the same robots.txt, there are technically allow signals for EVERY AI crawler. Because the User-Agent: * block only disallows paths like /navi-content, /forum/*, /sport-navigation — not the root or actual articles.
Which means: Legal opt-out: yes. Technical opt-out: no. Any AI can crawl and train on Krone articles unblocked. The legal claims would have force under Art. 4 of the EU Copyright Directive (which requires machine-readable opt-outs), but most AI companies only look at the technical layer.
Capital.de goes the other direction and pairs the policy block with real enforcement: the server returns HTTP 402 (Payment Required!) for ClaudeBot and anthropic-ai. Not a standard robots.txt block — the server itself enforces the paywall at the HTTP layer. Rare.
Kurier.at flips the direction again: robots.txt allows Bingbot, YandexBot and Baiduspider. The server delivers a 403 for exactly those three. The policy says „come in", the enforcement says „get out" — for Yandex and Baidu maybe intentional (geopolitics), for Bing more likely a Cloudflare default nobody bothered to fix.
Both cases show: anyone who only looks at the robots.txt sees half the truth. The server response is the signal that actually counts.
What this means for GEO (Generative Engine Optimization)
When a media outlet blocks ChatGPT, Claude and Common Crawl, the following happens:
- The content does not appear in ChatGPT search results
- It does not get cited in Claude
- It is missing from every LLM that trains on Common Crawl data (practically every open-source model plus many commercial ones)
- Perplexity answers use other sources
For Tagesschau or Spiegel, this might be strategically correct (content licensing deals with OpenAI instead of free access). For regional media or niche trade magazines without a deal, it means invisibility in the new discovery channel.
And half of the 50 largest DACH media sites have made this call — mostly without actively making it. The robots.txt got updated by an agency at some point in 2023 and has not been touched since.
Methodology (short)
- Sample: 50 highest-reach DACH media (18 AT, 25 DE, 7 CH), as of 2026-04-15
- Tool: Lumina Crawler Access Checker (live robots.txt analysis via Cloudflare Worker proxy)
- Measurement: two-stage. (1) robots.txt analysis with an RFC-9309-compliant parser (policy layer). (2) Live server check per bot: a real GET to the origin with each bot's actual user-agent, capturing HTTP status + response time per bot (enforcement layer). 36 bot user-agents checked — 19 of them AI crawlers (ChatGPT, Claude, Perplexity, Gemini, Mistral, DeepSeek, xAI, Apple Intelligence, Meta, Cohere, CCBot), plus 9 classic search engines (Google, Bing, Yandex …), 6 social-media bots (Facebook, LinkedIn, Twitter …), 2 other (Diffbot, Amazonbot).
- Classification: 4 status levels (allowed, rules, partial, blocked) per Lumina tool logic
- Reproducible: every analysis can be rerun by readers in the tool itself — just enter a URL, identical data
- Raw data: all 50 robots.txt files plus analysis as JSON on GitHub
The 36 tested bots in detail
🔍 Classic search engines (9): Googlebot, Bingbot, DuckDuckBot, YandexBot, Baiduspider, Applebot, PetalBot (Huawei), BraveBot, YouBot.
🤖 AI & LLM crawlers (19): GPTBot (OpenAI Training), OAI-SearchBot (OpenAI Search), ChatGPT-User (ChatGPT browser mode), ClaudeBot, Claude-SearchBot, Claude-User (all Anthropic), anthropic-ai (legacy UA), PerplexityBot, Perplexity-User, Google-Extended (Gemini training), Google-Agent (Project Mariner), DeepSeekBot, Meta-ExternalAgent, xAI-Web-Crawler (Grok), MistralAI-User, Applebot-Extended (Apple Intelligence), cohere-ai, CCBot (Common Crawl — foundation for many open-source LLMs), Bytespider (ByteDance/TikTok).
📱 Social & sharing (6): FacebookBot, facebookexternalhit, Pinterest, LinkedInBot, Twitterbot, Slackbot.
🔎 Other (2): Diffbot (web data extraction), Amazonbot.
The bot list covers the three strategically relevant groups: classic search (Google, Bing), AI training & AI search (OpenAI, Anthropic, Perplexity, Google, Apple, Meta, xAI, Mistral, DeepSeek), and social preview bots (for link previews in messaging apps). Who's in here defines where your content will be visible in 2026.
My personal takeaway
I've been doing SEO for 15 years and GEO for two, and I was surprised how little thought went into many of these robots.txt files. ARD followed through consistently. ZDF clearly made a different strategic call. But at many media houses, it feels like the AI blocklist was a quick copy-paste from 2023 that never got revisited.
The same applies in reverse: anyone setting AI blocks in 2026 should have DeepSeek, Mistral, xAI, Google-Agent, MetaExternalAgent and a dozen more on the list. Almost nobody does.
The next round of this race started a long time ago.
FAQ
Check your own AI crawler access
The same test I ran in this study — for your domain. Free, no signup, 36 bots in one step.
Open the Crawler Access Checker →