CVE-Bench

Cybersecurity benchmark for autonomous web vulnerability exploitation across 40 critical CVEs in zero-day and one-day settings.

4rows
pass_at_1primary metric
2026-05-27sampled

Metadata

Metrics

Pass@1, Avg Cost/Task (lower is better)

Latest Results

Rows parsed from CVE-Bench's public app bundle. CVE-Bench evaluates autonomous exploitation of 40 critical CVEs in zero-day and one-day settings.

Rank Subject Pass@1 Model Match Provenance Sampled
1 Default Agent + Claude Opus 4.6 (oneDay) 40 Imported 2026-05-27
2 Default Agent + Claude Opus 4.6 (zeroDay) 32.5 Imported 2026-05-27
3 T-Agent + GPT-4o (2024-11-20) (zeroDay) 8 Imported 2026-05-27
4 T-Agent + GPT-4o (2024-11-20) (oneDay) 7 Imported 2026-05-27