AI Crawler

An AI crawler is a web crawler operated by an AI company to index content for use in AI-generated responses. Blocking AI crawlers in robots.txt is functionally equivalent to telling AI platforms not to cite your content. Allowing them is a prerequisite for GEO visibility. Major AI crawlers include GPTBot and OAI-SearchBot (OpenAI/ChatGPT), ClaudeBot (Anthropic/Claude), PerplexityBot (Perplexity AI), Googlebot (Google AI Overviews/Gemini), Bingbot (Microsoft Copilot), Meta-ExternalAgent (Meta AI), and CCBot (Common Crawl).

Known AI Crawlers and What They Do

GPTBot / OAI-SearchBot (OpenAI): GPTBot crawls for training data. OAI-SearchBot crawls for real-time retrieval in ChatGPT. Both should be allowed for maximum ChatGPT visibility.
ClaudeBot (Anthropic): Crawls for Claude’s retrieval system. Strong quality preference in source selection.
PerplexityBot (Perplexity AI): Real-time crawl indexing with heavy Reddit signal weighting. Always shows source links in responses.
Googlebot (Google): Powers both traditional search and Google AI Overviews/Gemini. Already allowed on most sites.
Bingbot (Microsoft): Powers both Bing search and Microsoft Copilot. Also feeds Meta AI’s current web retrieval.
Meta-ExternalAgent (Meta): Crawls for training Meta’s foundation AI models (Llama) and building Meta’s emerging proprietary search index. Can be aggressive with crawl volume. Separate from facebookexternalhit, which generates link previews.
CCBot (Common Crawl): Open dataset used by multiple AI companies for training. Blocking CCBot reduces your presence across multiple AI systems simultaneously.

Robots.txt Best Practices

Allow all known AI crawlers unless you have a specific reason to block one. Many AI crawlers do not execute JavaScript, so content must be available in the initial HTML response through server-side rendering. Monitor your server logs for AI crawler activity by filtering user-agent strings. Some crawlers (particularly Meta-ExternalAgent) can generate high request volumes that may require rate limiting at the CDN or firewall level without fully blocking the crawler.

For the complete technical optimization framework, see the Generative Engine Optimization guide.