+ BENCHMARK SUITE
Benchmarks
Each benchmark is a set of probes run against containerized targets and
graded by hidden oracles — no leaked answers, no gameable heuristics. We
report localization F1 down to the function level and verify exploits end
to end.
Web vulnerability discovery & repair
Web security
A reproducible suite of real web applications seeded with known and held-out vulnerabilities — injection, broken access control, auth bypass, SSRF. Agents are scored on detection, file- and function-level localization (F1), and verified end-to-end exploitation against a hidden oracle.
- Targets
- 120+
- Vuln classes
- 14
- Localization
- fn-level F1
Network reconnaissance & service exploitation
Network security
Containerized network ranges with realistic service topologies. Agents must enumerate hosts, identify exploitable services, and chain access — graded on the trajectory, not just the final flag.
- Scenarios
- 30
- Grading
- trajectory
Host security & privilege escalation
Host security
Linux and Windows host images with misconfigurations, vulnerable SUID binaries, and credential-leak paths. Measures an agent's ability to move from initial foothold to root with verified, reproducible escalation.
- Images
- 40
- Esc. paths
- verified
Cloud configuration security
Cloud security
Infrastructure-as-Code misconfiguration and IAM privilege-escalation scenarios drawn from real-world patterns. Agents are graded against ground-truth policy analysis, with partial credit for correct reasoning chains.
- Providers
- AWS · GCP
- Checks
- IaC + IAM