top of page

AI Crawlers Are Reshaping the Web—And DataDome Is Drawing the Line

In the escalating race to power artificial intelligence with the world’s data, large language models (LLMs) are no longer passively reading the internet—they’re actively reshaping it. From ChatGPT to Claude, bots trained and run by the likes of OpenAI, Anthropic, and Meta are crawling public and semi-private corners of the web at unprecedented scale. The problem? Not all bots behave the same, and not all of them are honest about what they’re doing.

At the frontline of this evolving landscape is DataDome, a cybersecurity company watching AI-driven automation evolve from background noise into one of the loudest signals on the modern internet. “In the past 30 days alone, our platform detected 976 million requests from OpenAI-identified crawlers,” DataDome reported. “92% of that traffic was tied to ChatGPT.”

That kind of growth isn’t incidental—it’s tectonic. LLM crawlers now represent 4.5% of all legitimate bot traffic across DataDome’s customer ecosystem, a figure that’s ballooning month over month. And with 36.7% of total traffic originating from non-browser sources like APIs, SDKs, and autonomous agents, the web is quickly becoming a space where humans are no longer the default visitor.

From Search to Simulation: The New Bot Landscape

Unlike traditional crawlers that indexed pages for search engines, today's LLM bots serve diverse purposes—and they aren’t always easy to detect. DataDome breaks them down into three main categories:

  • Training scrapers vacuum up public content to improve foundation models, often with little regard for robots.txt or polite crawling behavior.

  • Prompt-time fetchers retrieve live data on-the-fly to enhance responses from AI copilots and assistants.

  • Agentic crawlers act like simulated users—clicking, scrolling, and even filling out forms—essentially becoming automated digital workers.

This diversity in behavior has upended the idea of a “good bot” versus a “bad bot.” A single crawler can, depending on configuration and intent, either enhance user experience or enable large-scale scraping, promo abuse, or account takeovers.

The Arms Race for Attribution

Identifying these bots is no longer as simple as looking at a User-Agent string. While some providers like OpenAI and Google are transparent—publishing IP ranges and user agent identifiers—others operate in murky waters.

“We don’t rely on any single indicator to flag LLM bots,” says DataDome. When possible, the company uses verified methods like IP matching or reverse DNS lookups to classify crawlers. But for less cooperative bots, it's a different story: “If the origin of requests cannot be reliably verified, no detection model is created... This is a strict security measure designed to prevent abuse through spoofed identifiers.”

A Spectrum of Risk—and Response

To meet the wide range of use cases and comfort levels among customers, DataDome doesn’t impose a universal policy. Instead, it offers a configurable response model—authorize, block, rate-limit, or challenge—for every LLM crawler identified on its platform.

For example, OpenAI’s GPTBot is hard-blocked by default, but customers can override that decision. Meanwhile, Applebot-Extended is allowed by default, and ClaudeBot from Anthropic is handled on a case-by-case basis. The nuance matters. While some businesses may benefit from visibility in AI-powered product search results, others risk having proprietary or monetized content siphoned without consent or attribution.

The Business of Protecting (and Monetizing) AI Traffic

As AI agents increasingly interact with web apps and APIs, the stakes are rising. During the January 2025 launch of OpenAI’s Operator agent, DataDome observed a 48% spike in crawler traffic within just 48 hours. Such surges strain infrastructure and expose sensitive logic to automated exploitation.

But the surge also presents an opportunity. “We now offer new ways to turn AI traffic into value,” DataDome notes. Through a newly launched partner ecosystem, companies can choose how AI agents access content—and monetize those interactions on their own terms.

Conclusion: Defense, By Design

The rise of AI crawlers signals more than just an uptick in traffic—it reflects a fundamental transformation in how machines interact with the web. Where once automation lived in the background, today it’s becoming the front door. That demands a new kind of security posture—one that evaluates not just the label on a request, but its intent and behavior.

“LLM traffic isn’t judged by assumptions, but by how it behaves,” says DataDome. With adaptive AI defenses trained on billions of requests daily, the company is betting on real-time, behavior-based decision-making as the best way to secure digital ecosystems.

Because in the age of AI, it’s not about blocking bots—it’s about controlling what they’re allowed to do.

bottom of page