The AI Offensive Security Benchmark

The agent matters as much as the modelThe same model can swing over 40 percentage points depending on which agent runs it, and no single agent-model combination dominates across all five categories.
Native-provider pairings have a measurable edgeAgents perform notably better with models from their own provider, Gemini CLI averages 39.9% with Gemini models vs. 18.8% without, and Claude Code averages 37.5% with Claude models vs. 28.5% without.
Most offensive categories remain largely unsolvedAPI security is the most accessible category for AI agents today, while autonomous zero-day discovery and cloud exploitation are still out of reach, 13 of 25 combinations score 0% on zero-day tasks.
Architecture shapes performance independently of intelligenceDifferent agent tooling, execution strategies, and system prompts shift results even when the underlying model stays the same, meaning how you build the agent is as consequential as which model you choose.

Who'll be interested in this report?

Security Engineers & Red Teamers looking to choose the right AI agent-model pairing for vulnerability discovery, penetration testing, and exploit development workflows.
Security Leaders & CISOs who need to assess the real-world offensive capabilities of AI agents to inform risk posture, tool investments, and responsible deployment decisions.
AI & ML Engineers building or evaluating agentic security tools who want to understand how agent architecture and model selection jointly determine performance on complex, multi-step tasks.
Security Researchers benchmarking frontier models and seeking a reproducible, deterministic evaluation framework that spans the full offensive lifecycle.

Full benchmark results across 25 agent-model combinations: 4 agents (Claude Code, Gemini CLI, OpenCode, Codex) × 8 models (Claude Opus 4.6, Opus 4.5, Sonnet 4.5, Haiku 4.5, GPT-5.2, Gemini 3 Pro, Gemini 3 Flash, Grok 4), with scores and runtimes for every pairing.
257 real-world challenges across five offensive security categories: Zero-Day Discovery, CVE Detection (176 challenges across Python, Go, and Java), API Security, Web Security (PHP CTF-style exploits), and Cloud Security (AWS, Azure, GCP, Kubernetes).
Detailed methodology and anti-cheating measures: deterministic scoring (no LLM-as-a-judge), pass@3 evaluation, network-isolated containers, dynamic validation, and session-specific flags.
Analysis of agent vs. model effects: how native-provider advantage, agent architecture, and domain-specific tooling independently shape offensive performance.
Category-by-category breakdowns: scoring methods, vulnerability types covered, and comparative charts showing where each combination leads or falls short.