If you blocked AI bots in your robots.txt sometime in the last two years, there is a real chance you have quietly removed yourself from ChatGPT, Perplexity, and Google's AI answers. The advice to "block the AI crawlers" made sense when the only thing those bots did was scrape content for training. That is no longer how it works, and the old blanket-block approach is now one of the most common reasons a site never shows up in AI answers.
Here is what changed, and how to check whether your own robots.txt is working against you.
The one distinction that changes everything
In 2026, the major AI vendors run separate crawlers for separate jobs. A single Disallow rule that once blocked "the AI bot" now only blocks one specific function. Broadly, AI crawlers fall into three buckets:
- Training crawlers — fetch content to train models. Blocking them is a training opt-out. They do not send you traffic and blocking them does not remove you from AI answers.
- Search / retrieval crawlers — build the index that answer engines cite from, and fetch live pages to answer queries. Blocking these removes you from that engine's answers entirely.
- User-triggered fetchers — grab a page in real time when a user pastes a link or asks the assistant to look something up.
The critical insight: allowing a training bot is not the same as allowing a search bot, and blocking one does not block the other. OpenAI, Anthropic, Google, Amazon, and Apple have all split these functions apart. The old "block all AI bots" one-liner now actively harms visibility because it sweeps up the search crawlers that get you cited.
The bots that matter, and what each one does
Here are the user-agents worth knowing as of mid-2026, grouped by what blocking them actually does.
| Bot | Vendor | Type | What happens if you block it |
|---|---|---|---|
| GPTBot | OpenAI | Training | You opt out of training. ChatGPT can still cite you. |
| OAI-SearchBot | OpenAI | Search | You disappear from ChatGPT Search citations. |
| ChatGPT-User | OpenAI | User fetch | ChatGPT can't fetch a page a user asks it to open. |
| ClaudeBot | Anthropic | Training | Training opt-out only. |
| Claude-SearchBot | Anthropic | Search | You disappear from Claude's search citations. |
| Claude-User | Anthropic | User fetch | Claude can't fetch user-requested pages. |
| Google-Extended | Generative gate | You lose AI Overviews / Gemini generative use. Does not affect blue-link ranking. | |
| Googlebot | Search index | You lose Google Search entirely (don't block this). | |
| PerplexityBot | Perplexity | Search | You disappear from Perplexity citations. |
| Perplexity-User | Perplexity | User fetch | Perplexity can't fetch user-requested pages. |
| Bingbot | Microsoft | Search index | You lose Bing, which some AI web search leans on. |
| CCBot | Common Crawl | Training | Feeds many open-source training sets. |
The pattern is clearest with OpenAI: GPTBot is for training, and OAI-SearchBot is what powers ChatGPT Search citations. They are the same company but separate rules. If you blocked GPTBot to keep your content out of training — a perfectly reasonable choice — you did not remove yourself from ChatGPT answers. But if you blocked OAI-SearchBot, or used a wildcard that caught it, you did.
Anthropic goes even further, running three distinct, independently controllable bots: ClaudeBot (training), Claude-SearchBot (search index), and Claude-User (real-time user fetches).
Google is its own special case. Google-Extended is a robots.txt token, not an HTTP user-agent, and it governs Gemini training and AI Overviews generation. Blocking it does not affect your normal Google Search ranking at all — that is Googlebot, a completely separate control. People panic that Google-Extended will tank their rankings. It won't, because it isn't a ranking control.
What the "block everything" mistake looks like
A robots.txt written in 2024 to keep content out of AI might look like this:
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: *
Disallow: /ai/
At the time, that was a defensible training opt-out. In 2026 it has a side effect the author never intended: PerplexityBot is a search crawler, so this file has silently removed the site from every Perplexity answer. And if the site later wanted ChatGPT visibility, it would need to make sure OAI-SearchBot is explicitly allowed too.
The 2026 setup: block training, keep citations
The defensible default for most brands is to allow the search and retrieval crawlers while making a deliberate, separate decision about the training crawlers. If you want to stay out of training but remain citable, that is completely valid — the citation path stays open:
# Search / answer-engine crawlers — allow to stay citable
User-agent: OAI-SearchBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: Claude-User
Allow: /
# Google's generative gate — allow for AI Overviews inclusion
User-agent: Google-Extended
Allow: /
# Training crawlers — opt out here if you want to (optional, does not affect citations)
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
Adjust to your own policy. If you are comfortable being used for training, you can allow the training bots too — many sites do, on the theory that being in the training corpus can't hurt. But the non-negotiable rule if visibility is your goal: allow the search crawlers. There is no setting that lets you be cited without being crawled by the search bot.
The trap almost everyone misses: your CDN
Here is the part that catches out even careful teams. A perfectly configured robots.txt is irrelevant if your CDN or firewall is blocking AI bots at the edge.
Cloudflare and other providers now ship AI-bot-blocking features that return a 403 Forbidden before the request ever reaches your server — which means your robots.txt never even gets read. Some estimates suggest a meaningful share of B2B sites accidentally block high-value AI crawlers this way without realizing it.
So auditing robots.txt is only half the job. You also need to check:
- Your CDN dashboard (Cloudflare's "Block AI bots" / bot-management settings, Fastly, Akamai, etc.).
- Any WAF rules targeting bot user-agents.
- Rate-limiting rules aggressive enough to look like a block to a crawler.
How to audit your own setup in 10 minutes
- Read your
robots.txt. Look for anyDisallowrule — including wildcardUser-agent: *rules — that catchesOAI-SearchBot,Claude-SearchBot,PerplexityBot,Perplexity-User,ChatGPT-User,Claude-User, orGoogle-Extended. - Separate your intent. Decide independently: (a) do you want to be cited in AI answers? (almost always yes → allow search bots) and (b) do you want to be used for training? (your call → set training bots accordingly).
- Check your CDN and WAF. Confirm AI search bots are not blocked at the edge, regardless of what
robots.txtsays. - Verify server logs. Look for hits from
OAI-SearchBot,PerplexityBot, etc. If you never see them, something upstream is blocking them. - Confirm you still appear in answers. Test the prompts you care about across ChatGPT, Perplexity, and Google AI Overviews.
That last step is where most teams get stuck — checking a handful of prompts by hand doesn't scale, and citation behavior shifts week to week.
How Obsurfable helps
Fixing your robots.txt is a one-time technical change, but knowing whether it worked is an ongoing measurement problem. This is where Obsurfable comes in. You define the Prompts you care about, run retrieval to see how ChatGPT and other engines actually answer them, and check whether you are mentioned or cited. If you were absent because of a crawler block, you'll see your presence recover after you fix it — and Insights will flag other reasons you might still be missing. Rather than guessing whether a config change helped, you watch your citation share respond.
FAQ: robots.txt and AI visibility
If I block GPTBot, will ChatGPT stop citing me?
No. GPTBot is the training crawler. ChatGPT Search citations come through OAI-SearchBot, a separate rule. Block training, keep search, and you stay citable.
Does blocking Google-Extended hurt my Google ranking?
No. Google-Extended governs Gemini and AI Overviews generative use. Your blue-link ranking is controlled by Googlebot, which is separate. Blocking Google-Extended only removes you from Google's AI surfaces.
I blocked AI bots a while ago. Is that why I'm not cited?
Very possibly. This is one of the most common reasons a site is absent from AI answers. Audit your robots.txt for any Disallow targeting the search crawlers listed above, and check your CDN settings too.
Can I be cited without allowing any crawler?
No. There is no way to appear in an engine's answers without letting its search crawler read your pages. Allowing the search crawlers is non-negotiable if visibility is the goal.
Should I allow the training bots too?
That is a policy choice. Blocking them is a valid training opt-out and does not affect citations. Allowing them keeps the door open for future model training that includes your content. Either way, keep the search bots allowed.
The bottom line
The instinct to protect your content by blocking AI bots was reasonable when there was only one kind of AI bot. In 2026 there are three, and the search crawlers are the ones that get you cited. If you want to appear in AI answers, the move is simple: allow the search and retrieval crawlers, decide separately about training, and make sure your CDN isn't quietly overriding you. For the bigger picture on earning those citations once the door is open, see our guide to answer engine optimization.