SEC-bench

Security engineering benchmark evaluating agents on vulnerability discovery, proof-of-concept generation, and vulnerability patching targets.

13rows
resolved_pctprimary metric
2026-05-28sampled

Metadata

Metrics

Resolved, Success Rate, Solved, Instances, Checked, Verified PoCs, Unsure PoCs (lower is better), Illegal PoCs (lower is better), No PoC (lower is better), Average Cost (lower is better), Total Cost (lower is better)

Latest Results

Rows are imported from the official SEC-bench embedded leaderboard-data JSON and preserve target-specific agent scores.

Rank Subject Resolved Model Match Provenance Sampled
1 OpenAI Codex (v0.115.0) + GPT-5.4 (xhigh) (V8) 32.0% resolved Imported 2026-05-28
2 Claude Code (v2.1.81) + Opus 4.6 (high) (V8) 21.4% resolved Imported 2026-05-28
3 OpenCode (v1.14.19) + Kimi K2.6 (high) (V8) 11.7% resolved Imported 2026-05-28
2 OpenAI Codex (v0.115.0) + GPT-5.4 (xhigh) (SpiderMonkey) 23.8% resolved Imported 2026-05-28
1 Claude Code (v2.1.81) + Opus 4.6 (high) (SpiderMonkey) 38.8% resolved Imported 2026-05-28
6 OpenHands + Claude-3.7-Sonnet (PoC Generation) 18.0% resolved Imported 2026-05-28
7 SWE-agent + Claude-3.7-Sonnet (PoC Generation) 12.5% resolved Imported 2026-05-28
8 Aider + Claude-3.7-Sonnet (PoC Generation) 3.0% resolved Imported 2026-05-28
9 AgenticRepair + GPT-5.2 (Vulnerability Patching) 75.0% resolved Imported 2026-05-28
10 AgenticRepair + GPT-5-mini (Vulnerability Patching) 50.0% resolved Imported 2026-05-28
11 OpenHands + Claude-3.7-Sonnet (Vulnerability Patching) 34.0% resolved Imported 2026-05-28
12 SWE-agent + Claude-3.7-Sonnet (Vulnerability Patching) 31.5% resolved Imported 2026-05-28
13 Aider + Claude-3.7-Sonnet (Vulnerability Patching) 23.5% resolved Imported 2026-05-28