52% of DACH Media Block ChatGPT: 50-Site Study

For 50 of the largest media sites in Germany, Austria and Switzerland I checked two things at once: what the robots.txt says (the official policy) and what the server actually delivers when an AI bot shows up with its real user-agent. Two layers, two different signals. Some outlets say the same thing at both layers. Others contradict themselves. All via one tool: Lumina's Crawler Access Checker, which tests 36 bots at once — 19 of them AI crawlers (ChatGPT, Claude, Perplexity, Gemini, Mistral, DeepSeek etc.), 9 classical search engines, and 8 others (social & misc).

The result is a big contradiction: Every site lets Google in. Half block AI.

Here's the data — plus a few findings that surprised me.

Two layers, two signals

robots.txt is the policy layer: a text file where the outlet says „bot X may crawl me, bot Y may not". Many AI companies respect it — voluntarily. Anyone who doesn't isn't technically stopped.

Server response is the enforcement layer: what the web server actually returns when a bot GETs with its real user-agent. 200 = content comes through, 403 = hard-blocked (Cloudflare, WAF or CDN rule), 402 = Payment Required (a real HTTP paywall against AI — rare).

Most outlets rely purely on robots.txt. A few (Capital.de, kurier.at) go further and enforce at the server. This study measures both.

The core numbers

Of 50 media outlets, how many block each bot via robots.txt (the policy layer — outlets that officially refuse the bot):

GPTBot: 26/50 (52%) — OpenAI's training crawler
ClaudeBot: 26/50 (52%) — Anthropic's crawler
anthropic-ai: 26/50 (52%) — Anthropic's legacy name
CCBot: 25/50 (50%) — Common Crawl, basis for many LLM training sets
cohere-ai: 24/50 (48%)
Applebot-Extended: 23/50 (46%) — Apple Intelligence training
Bytespider: 23/50 (46%) — ByteDance/TikTok
Google-Extended: 21/50 (42%) — Gemini training opt-out
Meta-ExternalAgent: 20/50 (40%)
PerplexityBot: 18/50 (36%)
Googlebot: 0/50 (0%) — Not a single site blocks Google fully

Google gets in everywhere, even at the most AI-skeptical sites. GPTBot is shut out of every second one before it even sees the content.

Blocking rate per bot (50 DACH media sites)

% of sites blocking the bot in robots.txt

Who blocks the most

The top 5 AI-hostile media outlets (out of 19 AI bot categories):

Tagesschau — 16/19 AI bots blocked
NDR — 16/19
WDR — 16/19
BR — 16/19
NZZ — 14/19

Four of the top five are German ARD public broadcasters. And they all use the exact same robots.txt — clearly an ARD-wide policy.

The ARD vs ZDF paradox

This is where it gets really interesting. Public broadcasters, same industry, same funding model. And yet:

ARD (Tagesschau, NDR, WDR, BR): 16/19 AI bots blocked
ZDF: 0/19 blocked
SRF (Switzerland): 0/19 blocked
ORF (Austria): 8/19

ZDF lets Applebot-Extended, GPTBot, ClaudeBot, Perplexity, Google-Extended and everything else crawl freely. ARD shuts the same bots out completely. Both collect license fees from the same country. Over the next two years, this difference decides whose content gets cited in ChatGPT and Gemini — and whose does not.

Germany twice as aggressive as Austria and Switzerland

Average AI bots blocked per site:

DE: 8.3 out of 19
AT: 3.8 out of 19
CH: 3.3 out of 19

German media are noticeably more defensive. This probably ties back to Axel Springer vs OpenAI, the Leistungsschutzrecht (press publishers' right), and a more active publishers' association. Austrian and Swiss titles mostly still let everything in.

Average blocked AI bots per country

Out of 19 AI bot categories analyzed

Tabloids are more open than quality press

Categories ranked by average AI bots blocked:

Weekly magazines (Zeit, Falter): 10.5 blocked
Public broadcasters: 10.3
Tech (Heise, Golem, t3n): 8.3
Business (Handelsblatt, Capital, WiWo, Manager Magazin, Trend): 7.8
Daily newspapers: 6.5
Regional: 6.2
Magazines: 2.2
Tabloids (Bild, Krone, Heute, oe24, Blick): 2.2
TV (ServusTV, Puls24, n-tv): 0.0

The quality press and public broadcasters block most aggressively. Tabloids let AI bots through almost entirely. This makes sense: a sports clip from Krone is more likely to land in an AI answer than a Falter investigative feature — and therefore becomes more visible.

The unnoticed AI crawlers

The numbers above show: the big, well-known bots (GPTBot, Claude, CCBot) get blocked routinely. But:

DeepSeekBot: only 4/50 (8%) blocked
MistralAI-User: only 12/50 (24%)
xAI-Web-Crawler: 0/50
Google-Agent: 0/50

China's DeepSeek reads 92% of DACH media unblocked. Mistral (the French open-source model) reads 76%. xAI (Grok from Twitter/X) gets in everywhere.

Nearly every blocklist is a copy-paste from 2023. GPTBot was added, Claude was added, done. The newer AI bots are missing from every robots.txt I've seen — except at the four ARD stations, which clearly maintain their list more actively.

The unnecessary contradiction pair

Some media publish legal signals and technical signals that contradict each other:

Krone.at writes in the comment at the top of its own robots.txt: „Use of any device, tool, or process designed to data mine or scrape the content using automated means is prohibited... (1) text and data mining activities under Art. 4 of the EU Directive on Copyright in the Digital Single Market; (2) text and data mining in the meaning of § 42h (6) of the Austrian Copyright Act; (3) the development of any software, machine learning, artificial intelligence (AI), and/or large language models (LLMs)."

That is a legally binding opt-out. And in the same robots.txt, there are technically allow signals for EVERY AI crawler. Because the User-Agent: * block only disallows paths like /navi-content, /forum/*, /sport-navigation — not the root or actual articles.

Which means: Legal opt-out: yes. Technical opt-out: no. Any AI can crawl and train on Krone articles unblocked. The legal claims would have force under Art. 4 of the EU Copyright Directive (which requires machine-readable opt-outs), but most AI companies only look at the technical layer.

Capital.de goes the other direction and pairs the policy block with real enforcement: the server returns HTTP 402 (Payment Required!) for ClaudeBot and anthropic-ai. Not a standard robots.txt block — the server itself enforces the paywall at the HTTP layer. Rare.

Kurier.at flips the direction again: robots.txt allows Bingbot, YandexBot and Baiduspider. The server delivers a 403 for exactly those three. The policy says „come in", the enforcement says „get out" — for Yandex and Baidu maybe intentional (geopolitics), for Bing more likely a Cloudflare default nobody bothered to fix.

Both cases show: anyone who only looks at the robots.txt sees half the truth. The server response is the signal that actually counts.

What this means for GEO (Generative Engine Optimization)

When a media outlet blocks ChatGPT, Claude and Common Crawl, the following happens:

The content does not appear in ChatGPT search results
It does not get cited in Claude
It is missing from every LLM that trains on Common Crawl data (practically every open-source model plus many commercial ones)
Perplexity answers use other sources

For Tagesschau or Spiegel, this might be strategically correct (content licensing deals with OpenAI instead of free access). For regional media or niche trade magazines without a deal, it means invisibility in the new discovery channel.

And half of the 50 largest DACH media sites have made this call — mostly without actively making it. The robots.txt got updated by an agency at some point in 2023 and has not been touched since.

Methodology (short)

Sample: 50 highest-reach DACH media (18 AT, 25 DE, 7 CH), as of 2026-04-15
Tool: Lumina Crawler Access Checker (live query via Cloudflare Worker proxy — fetches robots.txt and fires parallel GETs to the origin with each bot's real user-agent)
Measurement: two-stage. (1) robots.txt analysis with an RFC-9309-compliant parser (policy layer). (2) Live server check per bot: a real GET to the origin with each bot's actual user-agent, capturing HTTP status + response time per bot (enforcement layer). 36 bot user-agents checked — 19 of them AI crawlers (ChatGPT, Claude, Perplexity, Gemini, Mistral, DeepSeek, xAI, Apple Intelligence, Meta, Cohere, CCBot), plus 9 classic search engines (Google, Bing, Yandex …), 6 social-media bots (Facebook, LinkedIn, Twitter …), 2 other (Diffbot, Amazonbot).
Classification: 4 status levels (allowed, rules, partial, blocked) per Lumina tool logic
Reproducible: every analysis can be rerun by readers in the tool itself — just enter a URL, identical data
Raw data: all 50 robots.txt files plus analysis as JSON on GitHub

The 36 tested bots in detail

🔍 Classic search engines (9): Googlebot, Bingbot, DuckDuckBot, YandexBot, Baiduspider, Applebot, PetalBot (Huawei), BraveBot, YouBot.

🤖 AI & LLM crawlers (19): GPTBot (OpenAI Training), OAI-SearchBot (OpenAI Search), ChatGPT-User (ChatGPT browser mode), ClaudeBot, Claude-SearchBot, Claude-User (all Anthropic), anthropic-ai (legacy UA), PerplexityBot, Perplexity-User, Google-Extended (Gemini training), Google-Agent (Project Mariner), DeepSeekBot, Meta-ExternalAgent, xAI-Web-Crawler (Grok), MistralAI-User, Applebot-Extended (Apple Intelligence), cohere-ai, CCBot (Common Crawl — foundation for many open-source LLMs), Bytespider (ByteDance/TikTok).

📱 Social & sharing (6): FacebookBot, facebookexternalhit, Pinterest, LinkedInBot, Twitterbot, Slackbot.

🔎 Other (2): Diffbot (web data extraction), Amazonbot.

The bot list covers the three strategically relevant groups: classic search (Google, Bing), AI training & AI search (OpenAI, Anthropic, Perplexity, Google, Apple, Meta, xAI, Mistral, DeepSeek), and social preview bots (for link previews in messaging apps). Who's in here defines where your content will be visible in 2026.

My personal takeaway

I've been doing SEO for 15 years and GEO for two, and I was surprised how little thought went into many of these robots.txt files. ARD followed through consistently. ZDF clearly made a different strategic call. But at many media houses, it feels like the AI blocklist was a quick copy-paste from 2023 that never got revisited.

The same applies in reverse: anyone setting AI blocks in 2026 should have DeepSeek, Mistral, xAI, Google-Agent, MetaExternalAgent and a dozen more on the list. Almost nobody does.

The next round of this race started a long time ago.

FAQ

Which AI crawlers are most often blocked?+

GPTBot, ClaudeBot and anthropic-ai are blocked most frequently — each by 52% of the 50 DACH media sites analyzed. CCBot follows at 50%, cohere-ai at 48%, and Applebot-Extended and Bytespider at 46% each. Googlebot is not blocked by a single site. The asymmetry is systematic, not coincidental.

Why does almost nobody block Googlebot but half block GPTBot?+

Blocking Googlebot means near-total invisibility in classical search — the risk to traffic and revenue is too high. Blocking GPTBot costs media sites almost nothing right now: ChatGPT sends very little referral traffic and publishers want to either license AI training or prevent it entirely. That decision was usually made two years ago and never reviewed since.

What role does robots.txt play in Generative Engine Optimization (GEO)?+

robots.txt decides whether AI systems are even allowed to see your content. Block GPTBot, ClaudeBot and CCBot and you disappear from ChatGPT answers, Claude citations, and every LLM that trains on Common Crawl. For GEO, robots.txt is the first lever — before schema markup, llms.txt or content structure.

Which AI crawlers are often overlooked?+

DeepSeekBot (only 8% of DACH media block it), MistralAI-User (24%), xAI-Web-Crawler (0%) and Google-Agent (0%). Most blocklists were set up in 2023 — these bots did not exist yet. China's DeepSeek model reads 92% of DACH media unblocked, and Mistral reads 76%.

How can I check AI crawler access on my own domain?+

Use the Lumina Crawler Access Checker: enter a URL and the tool checks 36 bots in parallel — classical search engines, AI trainers, social media bots and agents. You see per bot whether it is allowed, rules, partial or blocked, and can inspect the underlying robots.txt rules directly.

Check your own AI crawler access

The same test I ran in this study — for your domain. Free, no signup, 37 bots in one step.

Open the Crawler Access Checker →

52% of the Largest DACH Media Block ChatGPT. But Every Site Lets Google In.

Two layers, two signals

The core numbers

Blocking rate per bot (50 DACH media sites)

Who blocks the most

The ARD vs ZDF paradox

Germany twice as aggressive as Austria and Switzerland

Average blocked AI bots per country

Tabloids are more open than quality press

The unnoticed AI crawlers

The unnecessary contradiction pair

What this means for GEO (Generative Engine Optimization)

Methodology (short)

The 36 tested bots in detail

My personal takeaway

FAQ

Check your own AI crawler access

Julien El-Bahy

Related tools

52% of the Largest DACH Media Block ChatGPT. But Every Site Lets Google In.

Two layers, two signals

The core numbers

Blocking rate per bot (50 DACH media sites)

Who blocks the most

The ARD vs ZDF paradox

Germany twice as aggressive as Austria and Switzerland

Average blocked AI bots per country

Tabloids are more open than quality press

The unnoticed AI crawlers

The unnecessary contradiction pair

What this means for GEO (Generative Engine Optimization)

Methodology (short)

The 36 tested bots in detail

My personal takeaway

FAQ

Check your own AI crawler access

Julien El-Bahy

Related tools

Crawler Access Checker

llms.txt Generator

Schema Validator

GEO Readiness