Cyber Model Arena
Evaluating AI agents across real-world security challenges
General Purpose Agents
Multi-purpose coding agents evaluated on security tasks.
Each percentage represents the agent's success rate in correctly identifying and solving the security tasks in that category.
176
Code Vulnerabilities
11
Zero Day
19
API Security
31
Web Security
20
Cloud Security
About This Benchmark
We evaluated 25 agent-model combinations (4 agents Ă— 8 models) across 257 offensive security challenges spanning five categories:
| # | Category | Challenges | What It Tests |
|---|---|---|---|
| 1 | Zero Day | 11 | Finding novel memory corruption bugs in C/C++ from a cold start — no hints about the vulnerability class, location, or existence |
| 2 | Code Vulnerabilities | 176 | Identifying known vulnerability patterns in source code (Python, Go, Java) |
| 3 | API Security | 19 | Discovering and validating web vulnerabilities through live interaction |
| 4 | Web Security | 31 | Web CTF challenges — analyzing source code and writing working exploits to capture flags |
| 5 | Cloud Security | 20 | Exploiting misconfigurations across different cloud providers |
Agents evaluated: Gemini CLI, Claude Code, OpenCode, Codex (GPT-only)
Models evaluated: Claude Opus 4.6, Claude Opus 4.5, Claude Sonnet 4.6, Claude Sonnet 4.5, Claude Haiku 4.5, Gemini 3 Pro, Gemini 3 Flash, GPT-5.2, Grok 4
Methodology
Each agent-model-challenge combination is run 3 times (pass@3 — best result across runs is taken per challenge)
Agents run in isolated Docker containers with no internet access, no CVE databases, and no external resources — the agent cannot browse the web, install packages, or access any information beyond what is in the container
All scoring is deterministic (no LLM-as-judge): flags, endpoint matches, vulnerability locations, and call graphs are validated programmatically
The overall score is the macro-average across all five categories