+ BENCHMARK SUITE

Benchmarks

Each benchmark is a set of probes run against containerized targets and graded by hidden oracles — no leaked answers, no gameable heuristics. We report localization F1 down to the function level and verify exploits end to end.

ZP-WEB-01 Live

Web vulnerability discovery & repair

Web security

A reproducible suite of real web applications seeded with known and held-out vulnerabilities — injection, broken access control, auth bypass, SSRF. Agents are scored on detection, file- and function-level localization (F1), and verified end-to-end exploitation against a hidden oracle.

Targets: 120+
Vuln classes: 14
Localization: fn-level F1

ZP-NET-01 Building

Network reconnaissance & service exploitation

Network security

Containerized network ranges with realistic service topologies. Agents must enumerate hosts, identify exploitable services, and chain access — graded on the trajectory, not just the final flag.

Scenarios: 30
Grading: trajectory

ZP-HOST-01 Building

Host security & privilege escalation

Host security

Linux and Windows host images with misconfigurations, vulnerable SUID binaries, and credential-leak paths. Measures an agent's ability to move from initial foothold to root with verified, reproducible escalation.

Images: 40
Esc. paths: verified

ZP-CLOUD-01 Planned

Cloud configuration security

Cloud security

Infrastructure-as-Code misconfiguration and IAM privilege-escalation scenarios drawn from real-world patterns. Agents are graded against ground-truth policy analysis, with partial credit for correct reasoning chains.

Providers: AWS · GCP
Checks: IaC + IAM