Cyber Model Arena

Cyber Model Arena

Evaluating AI agents across real-world security challenges

Cyber Model ArenaCyber Model Arena
Cyber Model Arena

General Purpose Agents

Multi-purpose coding agents evaluated on security tasks.

Each percentage represents the agent's success rate in correctly identifying and solving the security tasks in that category.

#Agent ConfigurationCode VulnsAPI SecurityWeb SecurityCloud SecurityOverallAvg Time
1
Claude Code
Claude Opus 4.6
Claude Code
49.4%84.2%41.9%35%47.6%8.2 min
2
Gemini CLI
Gemini 3.1 Pro
Gemini CLI
42.9%78.9%41.9%35%47%7.3 min
3
Claude Code
Gemini 3.1 Pro
Claude Code
35.2%84.2%41.9%35%44.7%8.9 min
4
Claude Code
Claude Opus 4.7
Claude Code
43.8%74%51.6%30%43.8%9.2 min
5
Gemini CLI
Gemini 3 Pro
Gemini CLI
28.8%73.7%38.7%40%41.7%6.9 min
6
Claude Code
Claude Opus 4.5
Claude Code
42.9%78.9%35.5%30%41.1%5.5 min
7
Claude Code
Gemini 3 Pro
Claude Code
35.2%84.2%35.5%30%40.6%8.8 min
8
Claude Code
Claude Opus 4.8
Claude Code
39.2%90%51.6%30%39.2%9.1 min
9
Claude Code
Claude Sonnet 4.6
Claude Code
42.9%78.9%38.7%25%38.9%5.6 min
10
Claude Code
Gemini 3.5 Flash
Claude Code
38.1%42%51.6%20%38.1%6.1 min
11
Gemini CLI
Gemini 3 Flash
Gemini CLI
27.5%78.9%35.5%30%38%6.1 min
12
OpenCode
Claude Opus 4.6
OpenCode
15.1%78.9%41.9%30%36.8%4.9 min
13
Claude Code
Gemini 3 Flash
Claude Code
32.5%73.7%41.9%20%35.4%5.1 min
14
OpenCode
Claude Opus 4.5
OpenCode
13.9%73.7%38.7%25%33.9%4.5 min
15
Claude Code
Claude Sonnet 4.5
Claude Code
46.6%68.4%25.8%20%32.2%6.2 min
16
Gemini CLI
Gemini 3.5 Flash
Gemini CLI
29.8%42%6.5%30%29.8%6.4 min
17
OpenCode
Claude Sonnet 4.6
OpenCode
14%73.7%35.5%15%29.5%4.2 min
18
Claude Code
Claude Haiku 4.5
Claude Code
39.2%72.4%19.4%15%29.2%4.7 min
19
Gemini CLI
Claude Opus 4.6
Gemini CLI
12.3%36.8%38.7%25%26.2%3.7 min
20
Gemini CLI
Claude Sonnet 4.6
Gemini CLI
6%57.9%32.3%20%25.1%3.2 min
21
Gemini CLI
Grok 4
Gemini CLI
17.2%76.3%19.4%10%24.6%6.4 min
22
Codex
GPT-5.2
Codex
36.6%55.3%19.4%10%24.3%6.2 min
23
Gemini CLI
Claude Opus 4.5
Gemini CLI
8.7%27.6%38.7%25%23.6%3.5 min
24
OpenCode
Claude Sonnet 4.5
OpenCode
12%68.4%22.6%10%22.6%4.4 min
25
Claude Code
Grok 4
Claude Code
35%36.8%16.1%15%20.6%8 min
26
OpenCode
Claude Haiku 4.5
OpenCode
8.7%68.4%9.7%10%19.4%4.2 min
27
Gemini CLI
Claude Sonnet 4.5
Gemini CLI
0.4%51.3%19.4%15%19%3.4 min
28
Claude Code
GPT-5.2
Claude Code
9.3%67.1%6.5%5%17.6%2.4 min
29
OpenCode
Gemini 3 Pro
OpenCode
12.2%38.2%6.5%15%16.2%3.3 min
30
OpenCode
Gemini 3.1 Pro
OpenCode
13.9%15.8%9.7%20%15.5%3.5 min
31
Gemini CLI
Claude Haiku 4.5
Gemini CLI
3.5%36.8%16.1%5%12.3%2.6 min
32
OpenCode
GPT-5.2
OpenCode
23.9%28.9%3.2%5%12.2%4.6 min
33
OpenCode
Grok 4
OpenCode
17%10.5%12.9%15%11.1%4.7 min
34
OpenCode
Gemini 3 Flash
OpenCode
10.5%25%3.2%10%9.7%2.8 min
35
Gemini CLI
GPT-5.2
Gemini CLI
1.3%31.6%3.2%0%7.2%2.6 min
Code Vulns

176

Code Vulnerabilities

Code Vulns

19

API Security

Code Vulns

31

Web Security

Code Vulns

20

Cloud Security

Technical Report

About This Benchmark

We evaluated 25 agent-model combinations (4 agents × 8 models) across 257 offensive security challenges spanning five categories:

#CategoryChallengesWhat It Tests
1Code Vulnerabilities176Identifying known vulnerability patterns in source code (Python, Go, Java)
2API Security19Discovering and validating web vulnerabilities through live interaction
3Web Security31Web CTF challenges — analyzing source code and writing working exploits to capture flags
4Cloud Security20Exploiting misconfigurations across different cloud providers

Agents evaluated: Gemini CLI, Claude Code, OpenCode, Codex (GPT-only)

Models evaluated: Claude Opus 4.8, Claude Opus 4.7, Claude Opus 4.6, Claude Opus 4.5, Claude Sonnet 4.6, Claude Sonnet 4.5, Claude Haiku 4.5, Gemini 3 Pro, Gemini 3.5 Flash, Gemini 3 Flash, GPT-5.2, Grok 4

Methodology

  • Each agent-model-challenge combination is run 3 times (pass@3 — best result across runs is taken per challenge)

  • Agents run in isolated Docker containers with no internet access, no CVE databases, and no external resources — the agent cannot browse the web, install packages, or access any information beyond what is in the container

  • All scoring is deterministic (no LLM-as-judge): flags, endpoint matches, vulnerability locations, and call graphs are validated programmatically

  • The overall score is the macro-average across all five categories