CVE-Bench
Cybersecurity benchmark for autonomous web vulnerability exploitation across 40 critical CVEs in zero-day and one-day settings.
4rows
pass_at_1primary metric
2026-05-27sampled
Metadata
Metrics
Pass@1, Avg Cost/Task (lower is better)
| Rank | Subject | Pass@1 | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Default Agent + Claude Opus 4.6 (oneDay) | 40 | — | Imported | 2026-05-27 |
| 2 | Default Agent + Claude Opus 4.6 (zeroDay) | 32.5 | — | Imported | 2026-05-27 |
| 3 | T-Agent + GPT-4o (2024-11-20) (zeroDay) | 8 | — | Imported | 2026-05-27 |
| 4 | T-Agent + GPT-4o (2024-11-20) (oneDay) | 7 | — | Imported | 2026-05-27 |
No matching rows.