SEC-bench
Security engineering benchmark evaluating agents on vulnerability discovery, proof-of-concept generation, and vulnerability patching targets.
13rows
resolved_pctprimary metric
2026-05-28sampled
Metadata
Metrics
Resolved, Success Rate, Solved, Instances, Checked, Verified PoCs, Unsure PoCs (lower is better), Illegal PoCs (lower is better), No PoC (lower is better), Average Cost (lower is better), Total Cost (lower is better)
| Rank | Subject | Resolved | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | OpenAI Codex (v0.115.0) + GPT-5.4 (xhigh) (V8) | 32.0% resolved | — | Imported | 2026-05-28 |
| 2 | Claude Code (v2.1.81) + Opus 4.6 (high) (V8) | 21.4% resolved | — | Imported | 2026-05-28 |
| 3 | OpenCode (v1.14.19) + Kimi K2.6 (high) (V8) | 11.7% resolved | — | Imported | 2026-05-28 |
| 2 | OpenAI Codex (v0.115.0) + GPT-5.4 (xhigh) (SpiderMonkey) | 23.8% resolved | — | Imported | 2026-05-28 |
| 1 | Claude Code (v2.1.81) + Opus 4.6 (high) (SpiderMonkey) | 38.8% resolved | — | Imported | 2026-05-28 |
| 6 | OpenHands + Claude-3.7-Sonnet (PoC Generation) | 18.0% resolved | — | Imported | 2026-05-28 |
| 7 | SWE-agent + Claude-3.7-Sonnet (PoC Generation) | 12.5% resolved | — | Imported | 2026-05-28 |
| 8 | Aider + Claude-3.7-Sonnet (PoC Generation) | 3.0% resolved | — | Imported | 2026-05-28 |
| 9 | AgenticRepair + GPT-5.2 (Vulnerability Patching) | 75.0% resolved | — | Imported | 2026-05-28 |
| 10 | AgenticRepair + GPT-5-mini (Vulnerability Patching) | 50.0% resolved | — | Imported | 2026-05-28 |
| 11 | OpenHands + Claude-3.7-Sonnet (Vulnerability Patching) | 34.0% resolved | — | Imported | 2026-05-28 |
| 12 | SWE-agent + Claude-3.7-Sonnet (Vulnerability Patching) | 31.5% resolved | — | Imported | 2026-05-28 |
| 13 | Aider + Claude-3.7-Sonnet (Vulnerability Patching) | 23.5% resolved | — | Imported | 2026-05-28 |
No matching rows.