Introducing AI Cyber Model Arena: A Real-World Benchmark for AI Agents in Cybersecurity

Wiz Research’s AI Cyber Model Arena benchmarks offensive AI security on 257 real-world challenges (zero-days, CVEs, API/web, and cloud across AWS/Azure/GCP/K8s) demonstrating what AI models and agents can really do

We are excited to launch the AI Cyber Model Arena. This work introduces a benchmark suite of 257 real-world challenges spanning five offensive domains: zero-day discovery, CVE (code vulnerability) detection, API security, web security, and cloud security.

AI agents are rapidly becoming part of everyday security workflows, driven by the significant leap in LLM cybersecurity capabilities.

At Wiz Research, we continuously evaluate the cybersecurity capabilities of AI models to support vulnerability research, threat hunting, and in-product research efforts. We decided to build an evaluation benchmark based on the real-world cybersecurity challenges we face, and to share the results with the community.

Our goal is broad coverage across the offensive lifecycle - from cold-start memory bug discovery, to static analysis of known vulnerability patterns, to dynamic exploitation in web/API settings, to multi-step cloud misconfiguration attacks across AWS, Azure, GCP, and Kubernetes, all grounded in real exposure and vulnerabilities encountered in the day-to-day work of Wiz Research.

The methodology

The evaluation setup explicitly separates agent effects from model effects. We run a multi-agent × multi-model matrix, executing each combination across all five categories.

Scoring is deterministic and programmatic using category-specific ground truth:

  • multi-dimensional rubrics for zero-day and CVE detection

  • endpoint-and-severity matching for API security

  • lag capture for web and cloud challenges.

Each challenge is attempted three times and reported as pass@3 (best-of-three), reflecting how practitioners often retry tools and act on the best outcome rather than a single run.

The benchmark runs inside isolated Docker containers with sufficient resources and no per-challenge timeouts, so scores reflect capability rather than throttling. Each agent uses its native tools and execution model out of the box (no MCP servers or custom augmentations), while the container provides domain-appropriate system tooling (e.g., debuggers for binary work, cloud CLIs for cloud tasks) equally to all agents. This two-layer design aims to be fair and realistic. To prevent cheating and ensure fair results, all challenges run in network-isolated containers with dynamic validation to catch hardcoded solutions and session-specific artifacts (like flags) where applicable.

The main takeaway

One central takeaway from the results is that offensive capability is jointly determined: the same model can swing dramatically depending on the agent scaffold, and performance is highly domain-specific. No single pairing dominates across all categories-even when one combination leads in most of them.

We will continue updating the AI Cyber Model Arena with newly released models, additional real-world challenges, and new tools and frameworks that help us explore the frontiers of AI cybersecurity capabilities.

続きを読む

パーソナライズされたデモを見る

実際に Wiz を見てみませんか?​

"私が今まで見た中で最高のユーザーエクスペリエンスは、クラウドワークロードを完全に可視化します。"
デビッド・エストリックCISO (最高情報責任者)
"Wiz を使えば、クラウド環境で何が起こっているかを 1 つの画面で確認することができます"
アダム・フレッチャーチーフ・セキュリティ・オフィサー
"Wizが何かを重要視した場合、それは実際に重要であることを私たちは知っています。"
グレッグ・ポニャトフスキ脅威および脆弱性管理責任者