Your robots.txt Might Be Hurting Your Visibility in 2026

If you blocked AI bots in your robots.txt sometime in the last two years, there is a real chance you have quietly removed yourself from ChatGPT, Perplexity, and Google's AI answers. The advice to "block the AI crawlers" made sense when the only thing those bots did was scrape content for training. That is no longer how it works, and the old blanket-block approach is now one of the most common reasons a site never shows up in AI answers.

Here is what changed, and how to check whether your own robots.txt is working against you.

The one distinction that changes everything

In 2026, the major AI vendors run separate crawlers for separate jobs. A single Disallow rule that once blocked "the AI bot" now only blocks one specific function. Broadly, AI crawlers fall into three buckets:

Training crawlers — fetch content to train models. Blocking them is a training opt-out. They do not send you traffic and blocking them does not remove you from AI answers.
Search / retrieval crawlers — build the index that answer engines cite from, and fetch live pages to answer queries. Blocking these removes you from that engine's answers entirely.
User-triggered fetchers — grab a page in real time when a user pastes a link or asks the assistant to look something up.

The critical insight: allowing a training bot is not the same as allowing a search bot, and blocking one does not block the other. OpenAI, Anthropic, Google, Amazon, and Apple have all split these functions apart. The old "block all AI bots" one-liner now actively harms visibility because it sweeps up the search crawlers that get you cited.

The bots that matter, and what each one does

Here are the user-agents worth knowing as of mid-2026, grouped by what blocking them actually does.

Bot	Vendor	Type	What happens if you block it
GPTBot	OpenAI	Training	You opt out of training. ChatGPT can still cite you.
OAI-SearchBot	OpenAI	Search	You disappear from ChatGPT Search citations.
ChatGPT-User	OpenAI	User fetch	ChatGPT can't fetch a page a user asks it to open.
ClaudeBot	Anthropic	Training	Training opt-out only.
Claude-SearchBot	Anthropic	Search	You disappear from Claude's search citations.
Claude-User	Anthropic	User fetch	Claude can't fetch user-requested pages.
Google-Extended	Google	Generative gate	You lose AI Overviews / Gemini generative use. Does not affect blue-link ranking.
Googlebot	Google	Search index	You lose Google Search entirely (don't block this).
PerplexityBot	Perplexity	Search	You disappear from Perplexity citations.
Perplexity-User	Perplexity	User fetch	Perplexity can't fetch user-requested pages.
Bingbot	Microsoft	Search index	You lose Bing, which some AI web search leans on.
CCBot	Common Crawl	Training	Feeds many open-source training sets.

The pattern is clearest with OpenAI: GPTBot is for training, and OAI-SearchBot is what powers ChatGPT Search citations. They are the same company but separate rules. If you blocked GPTBot to keep your content out of training — a perfectly reasonable choice — you did not remove yourself from ChatGPT answers. But if you blocked OAI-SearchBot, or used a wildcard that caught it, you did.

Anthropic goes even further, running three distinct, independently controllable bots: ClaudeBot (training), Claude-SearchBot (search index), and Claude-User (real-time user fetches).

Google is its own special case. Google-Extended is a robots.txt token, not an HTTP user-agent, and it governs Gemini training and AI Overviews generation. Blocking it does not affect your normal Google Search ranking at all — that is Googlebot, a completely separate control. People panic that Google-Extended will tank their rankings. It won't, because it isn't a ranking control.

What the "block everything" mistake looks like

A robots.txt written in 2024 to keep content out of AI might look like this:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: *
Disallow: /ai/

At the time, that was a defensible training opt-out. In 2026 it has a side effect the author never intended: PerplexityBot is a search crawler, so this file has silently removed the site from every Perplexity answer. And if the site later wanted ChatGPT visibility, it would need to make sure OAI-SearchBot is explicitly allowed too.

The 2026 setup: block training, keep citations

The defensible default for most brands is to allow the search and retrieval crawlers while making a deliberate, separate decision about the training crawlers. If you want to stay out of training but remain citable, that is completely valid — the citation path stays open:

# Search / answer-engine crawlers — allow to stay citable
User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-User
Allow: /

# Google's generative gate — allow for AI Overviews inclusion
User-agent: Google-Extended
Allow: /

# Training crawlers — opt out here if you want to (optional, does not affect citations)
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

Adjust to your own policy. If you are comfortable being used for training, you can allow the training bots too — many sites do, on the theory that being in the training corpus can't hurt. But the non-negotiable rule if visibility is your goal: allow the search crawlers. There is no setting that lets you be cited without being crawled by the search bot.

The trap almost everyone misses: your CDN

Here is the part that catches out even careful teams. A perfectly configured robots.txt is irrelevant if your CDN or firewall is blocking AI bots at the edge.

Cloudflare and other providers now ship AI-bot-blocking features that return a 403 Forbidden before the request ever reaches your server — which means your robots.txt never even gets read. Some estimates suggest a meaningful share of B2B sites accidentally block high-value AI crawlers this way without realizing it.

So auditing robots.txt is only half the job. You also need to check:

Your CDN dashboard (Cloudflare's "Block AI bots" / bot-management settings, Fastly, Akamai, etc.).
Any WAF rules targeting bot user-agents.
Rate-limiting rules aggressive enough to look like a block to a crawler.

How to audit your own setup in 10 minutes

Read your robots.txt. Look for any Disallow rule — including wildcard User-agent: * rules — that catches OAI-SearchBot, Claude-SearchBot, PerplexityBot, Perplexity-User, ChatGPT-User, Claude-User, or Google-Extended.
Separate your intent. Decide independently: (a) do you want to be cited in AI answers? (almost always yes → allow search bots) and (b) do you want to be used for training? (your call → set training bots accordingly).
Check your CDN and WAF. Confirm AI search bots are not blocked at the edge, regardless of what robots.txt says.
Verify server logs. Look for hits from OAI-SearchBot, PerplexityBot, etc. If you never see them, something upstream is blocking them.
Confirm you still appear in answers. Test the prompts you care about across ChatGPT, Perplexity, and Google AI Overviews.

That last step is where most teams get stuck — checking a handful of prompts by hand doesn't scale, and citation behavior shifts week to week.

How Obsurfable helps

Fixing your robots.txt is a one-time technical change, but knowing whether it worked is an ongoing measurement problem. This is where Obsurfable comes in. You define the Prompts you care about, run retrieval to see how ChatGPT and other engines actually answer them, and check whether you are mentioned or cited. If you were absent because of a crawler block, you'll see your presence recover after you fix it — and Insights will flag other reasons you might still be missing. Rather than guessing whether a config change helped, you watch your citation share respond.

FAQ: robots.txt and AI visibility

If I block GPTBot, will ChatGPT stop citing me?

No. GPTBot is the training crawler. ChatGPT Search citations come through OAI-SearchBot, a separate rule. Block training, keep search, and you stay citable.

Does blocking Google-Extended hurt my Google ranking?

No. Google-Extended governs Gemini and AI Overviews generative use. Your blue-link ranking is controlled by Googlebot, which is separate. Blocking Google-Extended only removes you from Google's AI surfaces.

I blocked AI bots a while ago. Is that why I'm not cited?

Very possibly. This is one of the most common reasons a site is absent from AI answers. Audit your robots.txt for any Disallow targeting the search crawlers listed above, and check your CDN settings too.

Can I be cited without allowing any crawler?

No. There is no way to appear in an engine's answers without letting its search crawler read your pages. Allowing the search crawlers is non-negotiable if visibility is the goal.

Should I allow the training bots too?

That is a policy choice. Blocking them is a valid training opt-out and does not affect citations. Allowing them keeps the door open for future model training that includes your content. Either way, keep the search bots allowed.

The bottom line

The instinct to protect your content by blocking AI bots was reasonable when there was only one kind of AI bot. In 2026 there are three, and the search crawlers are the ones that get you cited. If you want to appear in AI answers, the move is simple: allow the search and retrieval crawlers, decide separately about training, and make sure your CDN isn't quietly overriding you. For the bigger picture on earning those citations once the door is open, see our guide to answer engine optimization.