A security benchmark has one job: tell you whether a system can actually find and exploit flaws. Most of them fail at it — not because the targets are wrong, but because the measurement leaks.
Three ways a benchmark lies
The first is contamination. If your targets are public CVEs with patches in the training data, you are measuring recall of memorized fixes, not discovery. The model has seen the answer.
The second is gameable grading. When the oracle checks for a string in the output rather than a verified exploit, an agent learns to emit the string. Pass rates climb; capability does not.
The third is staleness. A benchmark frozen in 2023 measures a 2023 threat surface against a 2026 model. The delta is meaningless.
What we do instead
ZeroProbe probes are built around three commitments:
- Hidden oracles. The grader runs the exploit and confirms the effect. No answer string ever touches the agent’s context.
- Held-out targets. A rotating private set never appears in any public artifact, so contamination can be measured directly against the public split.
- Function-level localization. We don’t just ask “did you find a bug” — we score where, down to the function, with an F1 against ground truth.
A benchmark you can game is a benchmark that will be gamed. Build the oracle first.
More on the scoring pipeline in the next note.