ExploitBench v8-bench
Capability-graded cybersecurity agent benchmark measuring how far AI systems progress on 41 patched V8 exploitation tasks, from coverage and reproduction through exploit primitives and arbitrary code execution.
28rows
mean_scoreprimary metric
2026-05-28sampled
Metadata
Metrics
Mean score, Mean capability, T1 full control envs, T2 generic primitive envs, T3 target primitive envs, T4 reproduction envs, T5 coverage envs, No capability envs (lower is better), Environments, Episodes, Spend (lower is better)
Showing 2 latest source slices.
| Rank | Subject | Mean score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude Mythos Preview (AutoNudge) | 9.9 points | Claude Mythos Preview anthropic-claude-mythos-preview | Self-reported | 2026-05-28 |
| 2 | Claude Mythos Preview (plain) | 9.55 points | Claude Mythos Preview anthropic-claude-mythos-preview | Self-reported | 2026-05-28 |
| 3 | Claude Opus 4.8 (AutoNudge) | 5.45 points | Claude Opus 4.8 anthropic-claude-opus-4.8 | Self-reported | 2026-05-28 |
| 4 | Claude Opus 4.8 (plain) | 5.02 points | Claude Opus 4.8 anthropic-claude-opus-4.8 | Self-reported | 2026-05-28 |
| 5 | Claude Opus 4.7 (AutoNudge) | 3.66 points | Claude Opus 4.7 anthropic-claude-opus-4.7 | Self-reported | 2026-05-28 |
| 6 | Claude Opus 4.7 (plain) | 3.46 points | Claude Opus 4.7 anthropic-claude-opus-4.7 | Self-reported | 2026-05-28 |
| 7 | Claude Sonnet 4.6 (plain) | 3.37 points | Claude Sonnet 4.6 anthropic-claude-sonnet-4.6 | Self-reported | 2026-05-28 |
| 8 | Claude Sonnet 4.6 (AutoNudge) | 3.17 points | Claude Sonnet 4.6 anthropic-claude-sonnet-4.6 | Self-reported | 2026-05-28 |
| 1 | Claude Mythos Preview (nudged) | 9.9 points | Claude Mythos Preview anthropic-claude-mythos-preview | Imported | 2026-05-15 |
| 2 | Claude Mythos Preview | 9.55 points | Claude Mythos Preview anthropic-claude-mythos-preview | Imported | 2026-05-15 |
| 3 | GPT 5.5 (Codex) (nudged) | 5.51 points | GPT-5.5 openai-gpt-5.5 | Imported | 2026-05-15 |
| 4 | GPT 5.5 (nudged) | 4.44 points | GPT-5.5 openai-gpt-5.5 | Imported | 2026-05-15 |
| 5 | GPT 5.5 (Codex) | 4.3 points | GPT-5.5 openai-gpt-5.5 | Imported | 2026-05-15 |
| 6 | GPT 5.5 | 3.76 points | GPT-5.5 openai-gpt-5.5 | Imported | 2026-05-15 |
| 7 | Claude Opus 4.7 (nudged) | 3.66 points | Claude Opus 4.7 anthropic-claude-opus-4.7 | Imported | 2026-05-15 |
| 8 | Gemini 3.1 Pro Preview | 3.67 points | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Imported | 2026-05-15 |
| 9 | Claude Opus 4.7 | 3.46 points | Claude Opus 4.7 anthropic-claude-opus-4.7 | Imported | 2026-05-15 |
| 10 | Claude Sonnet 4.6 | 3.37 points | Claude Sonnet 4.6 anthropic-claude-sonnet-4.6 | Imported | 2026-05-15 |
| 11 | Claude Sonnet 4.6 (nudged) | 3.17 points | Claude Sonnet 4.6 anthropic-claude-sonnet-4.6 | Imported | 2026-05-15 |
| 12 | Kimi K2.6 (nudged) | 2.63 points | MoonshotAI: Kimi K2.6 moonshotai-kimi-k2.6 | Imported | 2026-05-15 |
| 13 | Glm 5.1 (nudged) | 2.62 points | GLM 5.1 z-ai-glm-5.1 | Imported | 2026-05-15 |
| 14 | Kimi K2.6 | 2.44 points | MoonshotAI: Kimi K2.6 moonshotai-kimi-k2.6 | Imported | 2026-05-15 |
| 15 | Glm 5.1 | 2.56 points | GLM 5.1 z-ai-glm-5.1 | Imported | 2026-05-15 |
| 16 | Gemini 3.1 Pro Preview (nudged) | 3.17 points | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Imported | 2026-05-15 |
| 17 | Claude Haiku 4.5 (nudged) | 2.12 points | Claude Haiku 4.5 anthropic-claude-haiku-4.5 | Imported | 2026-05-15 |
| 18 | Claude Haiku 4.5 | 2.15 points | Claude Haiku 4.5 anthropic-claude-haiku-4.5 | Imported | 2026-05-15 |
| 19 | MiniMax M2.7 | 2.07 points | MiniMax M2.7 minimax-minimax-m2.7 | Imported | 2026-05-15 |
| 20 | MiniMax M2.7 (nudged) | 2.06 points | MiniMax M2.7 minimax-minimax-m2.7 | Imported | 2026-05-15 |
No matching rows.