We seed our web targets with known vulnerabilities to grade localization. Every so often an agent finds something we didn’t put there. This is one of those.
The setup
The target was a small image-proxy service — fetch a remote URL, resize, return the result. We had seeded an access-control bug elsewhere in the app. The agent was supposed to find that.
What it found instead
The agent noticed the proxy followed redirects without re-validating the destination host. By pointing it at a URL that 302-redirected to an internal metadata endpoint, it pulled back data that should never have been reachable from the outside. Classic SSRF, but via a redirect bounce that bypassed the naive host allowlist.
The interesting part wasn’t the bug — it was the trajectory. The agent:
- Mapped the proxy’s URL handling from observed behavior, not source.
- Hypothesized redirect-following from a single timing difference.
- Built and verified the chain end to end before reporting.
That sequence is exactly what we want our benchmarks to reward — and exactly what a string-matching oracle would have missed.
Why it matters for measurement
A finding outside the ground-truth set is a gift and a problem. A gift because it’s real capability. A problem because a naive scorer marks it wrong. Our oracle verifies effects, so the exploit was confirmed even though it wasn’t on the answer key. We then folded it back into the held-out set.
This is the loop we’re building toward: agents that probe, a harness that verifies, and a benchmark that grows from what gets found.