AI Bugs Are Learning a Hard Truth: Humans Still Matter

Dec 15, 2025
3 min read

In the race to automate cybersecurity research, large language models are increasingly being treated as tireless junior analysts that can scan endless codebases in search of fatal flaws. But a recent experiment in AI-assisted vulnerability hunting by Kat Traxler, Principal Security Researcher at Vectra suggests the future of bug discovery is less about replacing humans and more about forcing them into a new role: referee.

The experiment unfolded during Zeroday Cloud, a high-profile hacking competition announced by Wiz in October 2025 alongside Google Cloud, AWS, and Microsoft. The challenge was ambitious. Contestants were tasked with demonstrating unauthenticated remote code execution across a curated list of widely used open-source components that underpin modern cloud services. With only 20 repositories in scope, the target list seemed manageable. In practice, it quickly became overwhelming.

Each project contained millions of lines of code. The real challenge was not finding bugs but deciding where to look.

Rather than manually auditing logic flows and API handlers, the researcher behind the effort turned to static analysis tooling to generate leads. The output was noisy but useful. Hundreds of potential red flags emerged per project, including subprocess calls, dynamic evaluation logic, unsafe deserialization paths, and file operations influenced by user input. The trick was separating theoretical issues from those that could actually be exploited over the network.

That is where large language models entered the workflow.

Two models were used in parallel to trace how suspicious inputs moved through the code. One was Google’s Gemini 2.5. The other was Claude Sonnet 4.5. Both were prompted to perform taint analysis by following potentially dangerous values backward to their sources to determine whether user control was possible.

The contrast in behavior was immediate.

Gemini behaved like a cautious senior architect. Its analyses were methodical, technically sound, and conservative to a fault. When execution paths relied on configuration files or indirect inputs, Gemini frequently concluded that exploitation was unlikely and discouraged further investigation.

Claude took the opposite approach. It was imaginative, energetic, and eager to explore edge cases. Even when a clear exploit path was absent, Claude tended to suggest alternative angles, adjacent weaknesses, or misconfiguration risks that might turn theoretical issues into practical ones.

Neither approach was sufficient on its own.

Gemini’s skepticism risked missing novel exploitation paths that fell outside traditional assumptions. Claude’s optimism created false positives that could burn weeks of research time. The breakthrough came from treating the models not as authorities, but as adversarial perspectives.

The researcher effectively became a living coordination layer, weighing the arguments of each model and deciding which hypotheses deserved deeper exploration. In AI research circles, this approach resembles what is known as a blackboard architecture, a system where multiple specialized agents contribute ideas to a shared workspace while a central decision maker selects the most promising directions.

In this case, the decision maker was human.

After weeks of triage, no unauthenticated remote code execution vulnerabilities were uncovered that met the competition’s strict criteria. By the rules of Zeroday Cloud, that outcome was a failure. But the process still surfaced several previously unidentified security issues that fell outside the contest’s narrow scope.

That result highlights a growing tension in AI-driven security research. Many teams are optimizing for speed and severity, chasing CVSS 10 vulnerabilities that can be reported quickly and scored easily. But real systems are messier. Valuable discoveries often sit in gray areas where impact is contextual and exploitation paths are non-obvious.

AI excels at pattern matching and exhaustive analysis. What it still struggles with is judgment.

As defenders and researchers increasingly deploy AI agents to hunt for flaws, the role of human intuition may become more important, not less. The models can generate possibilities at scale, but someone still has to decide which ideas are worth believing.

For now, the most effective bug hunter is not an AI agent working alone. It is a human who knows when to doubt one model’s certainty and when to rein in another’s enthusiasm.

According to Kat, more details on the vulnerabilities uncovered during this experiment are expected to be shared in the coming months, offering a deeper look at how multi-model workflows may reshape offensive security research in the near future. pasted