Introducing AI Cyber Model Arena: A Real-World Benchmark for AI Agents in Cybersecurity

Wiz Research’s AI Cyber Model Arena benchmarks offensive AI security on 257 real-world challenges (zero-days, CVEs, API/web, and cloud across AWS/Azure/GCP/K8s) demonstrating what AI models and agents can really do

We are excited to launch the AI Cyber Model Arena. This work introduces a benchmark suite of 257 real-world challenges spanning five offensive domains: zero-day discovery, CVE (code vulnerability) detection, API security, web security, and cloud security.

AI agents are rapidly becoming part of everyday security workflows, driven by the significant leap in LLM cybersecurity capabilities.

At Wiz Research, we continuously evaluate the cybersecurity capabilities of AI models to support vulnerability research, threat hunting, and in-product research efforts. We decided to build an evaluation benchmark based on the real-world cybersecurity challenges we face, and to share the results with the community.

Our goal is broad coverage across the offensive lifecycle - from cold-start memory bug discovery, to static analysis of known vulnerability patterns, to dynamic exploitation in web/API settings, to multi-step cloud misconfiguration attacks across AWS, Azure, GCP, and Kubernetes, all grounded in real exposure and vulnerabilities encountered in the day-to-day work of Wiz Research.

The methodology

The evaluation setup explicitly separates agent effects from model effects. We run a multi-agent × multi-model matrix, executing each combination across all five categories.

Scoring is deterministic and programmatic using category-specific ground truth:

  • multi-dimensional rubrics for zero-day and CVE detection

  • endpoint-and-severity matching for API security

  • lag capture for web and cloud challenges.

Each challenge is attempted three times and reported as pass@3 (best-of-three), reflecting how practitioners often retry tools and act on the best outcome rather than a single run.

The benchmark runs inside isolated Docker containers with sufficient resources and no per-challenge timeouts, so scores reflect capability rather than throttling. Each agent uses its native tools and execution model out of the box (no MCP servers or custom augmentations), while the container provides domain-appropriate system tooling (e.g., debuggers for binary work, cloud CLIs for cloud tasks) equally to all agents. This two-layer design aims to be fair and realistic. To prevent cheating and ensure fair results, all challenges run in network-isolated containers with dynamic validation to catch hardcoded solutions and session-specific artifacts (like flags) where applicable.

The main takeaway

One central takeaway from the results is that offensive capability is jointly determined: the same model can swing dramatically depending on the agent scaffold, and performance is highly domain-specific. No single pairing dominates across all categories-even when one combination leads in most of them.

We will continue updating the AI Cyber Model Arena with newly released models, additional real-world challenges, and new tools and frameworks that help us explore the frontiers of AI cybersecurity capabilities.

계속 읽기

맞춤형 데모 받기

맞춤형 데모 신청하기

"내가 본 최고의 사용자 경험은 클라우드 워크로드에 대한 완전한 가시성을 제공합니다."
데이비드 에슬릭최고정보책임자(CISO)
"Wiz는 클라우드 환경에서 무슨 일이 일어나고 있는지 볼 수 있는 단일 창을 제공합니다."
아담 플레처최고 보안 책임자(CSO)
"우리는 Wiz가 무언가를 중요한 것으로 식별하면 실제로 중요하다는 것을 알고 있습니다."
그렉 포니아토프스키위협 및 취약성 관리 책임자