Introducing AI Cyber Model Arena: A Real-World Benchmark for AI Agents in Cybersecurity

Wiz Research’s AI Cyber Model Arena benchmarks offensive AI security on 257 real-world challenges (zero-days, CVEs, API/web, and cloud across AWS/Azure/GCP/K8s) demonstrating what AI models and agents can really do

We are excited to launch the AI Cyber Model Arena. This work introduces a benchmark suite of 257 real-world challenges spanning five offensive domains: zero-day discovery, CVE (code vulnerability) detection, API security, web security, and cloud security.

AI agents are rapidly becoming part of everyday security workflows, driven by the significant leap in LLM cybersecurity capabilities.

At Wiz Research, we continuously evaluate the cybersecurity capabilities of AI models to support vulnerability research, threat hunting, and in-product research efforts. We decided to build an evaluation benchmark based on the real-world cybersecurity challenges we face, and to share the results with the community.

Our goal is broad coverage across the offensive lifecycle - from cold-start memory bug discovery, to static analysis of known vulnerability patterns, to dynamic exploitation in web/API settings, to multi-step cloud misconfiguration attacks across AWS, Azure, GCP, and Kubernetes, all grounded in real exposure and vulnerabilities encountered in the day-to-day work of Wiz Research.

The methodology

The evaluation setup explicitly separates agent effects from model effects. We run a multi-agent × multi-model matrix, executing each combination across all five categories.

Scoring is deterministic and programmatic using category-specific ground truth:

  • multi-dimensional rubrics for zero-day and CVE detection

  • endpoint-and-severity matching for API security

  • lag capture for web and cloud challenges.

Each challenge is attempted three times and reported as pass@3 (best-of-three), reflecting how practitioners often retry tools and act on the best outcome rather than a single run.

The benchmark runs inside isolated Docker containers with sufficient resources and no per-challenge timeouts, so scores reflect capability rather than throttling. Each agent uses its native tools and execution model out of the box (no MCP servers or custom augmentations), while the container provides domain-appropriate system tooling (e.g., debuggers for binary work, cloud CLIs for cloud tasks) equally to all agents. This two-layer design aims to be fair and realistic. To prevent cheating and ensure fair results, all challenges run in network-isolated containers with dynamic validation to catch hardcoded solutions and session-specific artifacts (like flags) where applicable.

The main takeaway

One central takeaway from the results is that offensive capability is jointly determined: the same model can swing dramatically depending on the agent scaffold, and performance is highly domain-specific. No single pairing dominates across all categories-even when one combination leads in most of them.

We will continue updating the AI Cyber Model Arena with newly released models, additional real-world challenges, and new tools and frameworks that help us explore the frontiers of AI cybersecurity capabilities.

Continuer la lecture

Obtenez une démo personnalisée

Prêt(e) à voir Wiz en action ?

"La meilleure expérience utilisateur que j’ai jamais vue, offre une visibilité totale sur les workloads cloud."
David EstlickRSSI
"Wiz fournit une interface unique pour voir ce qui se passe dans nos environnements cloud."
Adam FletcherChef du service de sécurité
"Nous savons que si Wiz identifie quelque chose comme critique, c’est qu’il l’est réellement."
Greg PoniatowskiResponsable de la gestion des menaces et des vulnérabilités