2026-03-31 · robots.txtSEOAIWeb

AI Crawler Opt-Out: What robots.txt Can and Can't Do

Every webmaster is now asking whether to block AI crawlers. The answer isn't obvious — and robots.txt is a weaker tool than most people think. Here's what actually happens when you add GPTBot to your disallow list.

Since OpenAI published GPTBot's user agent string in 2023, the robots.txt conversation has changed permanently. Now there are a dozen AI crawlers with named user agents, and every site owner has to decide: allow them, block them, or do nothing.

Before you add a line to your robots.txt, it's worth understanding what that line actually does — and what it doesn't.

The crawlers you need to know about

The major AI crawlers with named user agents that respond to robots.txt:

What robots.txt actually does

robots.txt is a voluntary protocol. There is no technical enforcement. A well-behaved crawler reads your robots.txt before crawling and respects the Disallow directives. A poorly-behaved or malicious crawler ignores it entirely.

The major named AI crawlers — GPTBot, ClaudeBot, Google-Extended — are from large companies with reputational stakes and legal teams. They generally respect robots.txt. The long tail of smaller crawlers and scrapers? Much less consistent.

Key point: blocking a crawler in robots.txt prevents future crawling from that crawler. It does not remove your content from training datasets that were built before you added the block. If GPTBot already crawled your site last year, that data is already in the training set.

Should you block AI crawlers?

This is a genuine strategic decision, not a clear-cut technical one. The right answer depends on what you want:

Block if: your content is proprietary, paywalled, or you're concerned about your work being used to train commercial AI models without compensation. Publishers, news organizations, and content creators with monetized archives are the clearest cases for blocking.

Allow if: you want your content cited in AI-generated answers. When someone asks ChatGPT or Perplexity a question and your content is in the training data or can be crawled, there's a chance your site gets cited. This is called GEO — Generative Engine Optimisation — and it's the emerging counterpart to traditional SEO. Sites that block all AI crawlers are excluded from this entirely.

Selective approach: block training crawlers (GPTBot, CCBot) while allowing retrieval crawlers (PerplexityBot) and keeping Google-Extended allowed for AI Overviews. This attempts to avoid training data use while preserving citation opportunities.

How to block specific AI crawlers

# Block OpenAI training crawler:
User-agent: GPTBot
Disallow: /

# Block Anthropic crawler:
User-agent: ClaudeBot
Disallow: /

# Block Google's AI training (doesn't affect search rankings):
User-agent: Google-Extended
Disallow: /

# Block ByteDance:
User-agent: Bytespider
Disallow: /

# Block Common Crawl:
User-agent: CCBot
Disallow: /

How to allow all AI crawlers explicitly

# Explicit allow for AI crawlers (GEO strategy):
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: PerplexityBot
Allow: /

You don't technically need to add Allow directives if you're not blocking anything — crawlers are allowed by default. But explicit Allow entries signal intent and may be read by AI systems evaluating whether your content is available for citation.

What about the Google-Extended nuance

Google-Extended is worth understanding separately because it controls two different things: whether your content is used for AI model training, and whether it's included in AI Overviews (the AI-generated summaries at the top of search results).

Blocking Google-Extended removes you from both. If you want to appear in AI Overviews (which drives traffic) but don't want your content in training data, the current robots.txt spec doesn't support that distinction — it's all or nothing with Google-Extended.

Check your current AI bot coverage

Most robots.txt files were written before AI crawlers existed and don't address them at all. The first step is knowing where you currently stand.

Paste your robots.txt to see your AI bot coverage score — which crawlers are explicitly allowed, which are blocked, and which are unaddressed.

Open robots.txt Validator →

The honest answer

robots.txt is a reasonable first line of defence against well-behaved AI crawlers. It's not a legal instrument, it's not technically enforced, and it doesn't reach data that's already been collected. But for the major named crawlers from large companies, it's generally respected and worth using if you have a clear preference.

The most important thing is to make a deliberate decision rather than ignoring the question. An unaddressed robots.txt in 2026 is an implicit "I haven't thought about this" — which is a fine answer but probably not the one you intend.

Related guides