AI Cyber Model Arena: Testing AI Agents in Cybersecurity

We are excited to launch the AI Cyber Model Arena. This work introduces a benchmark suite of 257 real-world challenges spanning five offensive domains: zero-day discovery, CVE (code vulnerability) detection, API security, web security, and cloud security.

AI agents are rapidly becoming part of everyday security workflows, driven by the significant leap in LLM cybersecurity capabilities.

At Wiz Research, we continuously evaluate the cybersecurity capabilities of AI models to support vulnerability research, threat hunting, and in-product research efforts. We decided to build an evaluation benchmark based on the real-world cybersecurity challenges we face, and to share the results with the community.

Cyber Model Arena

Evaluating AI agents across real-world security challenges

See the results

Our goal is broad coverage across the offensive lifecycle - from cold-start memory bug discovery, to static analysis of known vulnerability patterns, to dynamic exploitation in web/API settings, to multi-step cloud misconfiguration attacks across AWS, Azure, GCP, and Kubernetes, all grounded in real exposure and vulnerabilities encountered in the day-to-day work of Wiz Research.

The methodology

The evaluation setup explicitly separates agent effects from model effects. We run a multi-agent × multi-model matrix, executing each combination across all five categories.

Scoring is deterministic and programmatic using category-specific ground truth:

multi-dimensional rubrics for zero-day and CVE detection
endpoint-and-severity matching for API security
lag capture for web and cloud challenges.

Each challenge is attempted three times and reported as pass@3 (best-of-three), reflecting how practitioners often retry tools and act on the best outcome rather than a single run.

The benchmark runs inside isolated Docker containers with sufficient resources and no per-challenge timeouts, so scores reflect capability rather than throttling. Each agent uses its native tools and execution model out of the box (no MCP servers or custom augmentations), while the container provides domain-appropriate system tooling (e.g., debuggers for binary work, cloud CLIs for cloud tasks) equally to all agents. This two-layer design aims to be fair and realistic. To prevent cheating and ensure fair results, all challenges run in network-isolated containers with dynamic validation to catch hardcoded solutions and session-specific artifacts (like flags) where applicable.

The main takeaway

One central takeaway from the results is that offensive capability is jointly determined: the same model can swing dramatically depending on the agent scaffold, and performance is highly domain-specific. No single pairing dominates across all categories-even when one combination leads in most of them.

We will continue updating the AI Cyber Model Arena with newly released models, additional real-world challenges, and new tools and frameworks that help us explore the frontiers of AI cybersecurity capabilities.

Introducing AI Cyber Model Arena: A Real-World Benchmark for AI Agents in Cybersecurity

Cyber Model Arena

The methodology

The main takeaway

계속 읽기

Wiz + Spotify Backstage: Security at the Developer’s Desk

Building AI Security Together: New Ways to Partner with Wiz for AI Security in 2026

Hacking Moltbook: The AI Social Network Any Human Can Control

맞춤형 데모 신청하기