ExploitBench v8-bench

Capability-graded cybersecurity agent benchmark measuring how far AI systems progress on 41 patched V8 exploitation tasks, from coverage and reproduction through exploit primitives and arbitrary code execution.

28rows
mean_scoreprimary metric
2026-05-28sampled

Metadata

Metrics

Mean score, Mean capability, T1 full control envs, T2 generic primitive envs, T3 target primitive envs, T4 reproduction envs, T5 coverage envs, No capability envs (lower is better), Environments, Episodes, Spend (lower is better)

Showing 2 latest source slices.

Latest Results

Provider-published system-card benchmark scores parsed from Anthropic's Claude Opus 4.8 capability evaluation tables. Rows are marked self-reported and should be interpreted as source claims unless independently reproduced.

Rank Subject Mean score Model Match Provenance Sampled
1 Claude Mythos Preview (AutoNudge) 9.9 points Claude Mythos Preview
anthropic-claude-mythos-preview
Self-reported 2026-05-28
2 Claude Mythos Preview (plain) 9.55 points Claude Mythos Preview
anthropic-claude-mythos-preview
Self-reported 2026-05-28
3 Claude Opus 4.8 (AutoNudge) 5.45 points Claude Opus 4.8
anthropic-claude-opus-4.8
Self-reported 2026-05-28
4 Claude Opus 4.8 (plain) 5.02 points Claude Opus 4.8
anthropic-claude-opus-4.8
Self-reported 2026-05-28
5 Claude Opus 4.7 (AutoNudge) 3.66 points Claude Opus 4.7
anthropic-claude-opus-4.7
Self-reported 2026-05-28
6 Claude Opus 4.7 (plain) 3.46 points Claude Opus 4.7
anthropic-claude-opus-4.7
Self-reported 2026-05-28
7 Claude Sonnet 4.6 (plain) 3.37 points Claude Sonnet 4.6
anthropic-claude-sonnet-4.6
Self-reported 2026-05-28
8 Claude Sonnet 4.6 (AutoNudge) 3.17 points Claude Sonnet 4.6
anthropic-claude-sonnet-4.6
Self-reported 2026-05-28
1 Claude Mythos Preview (nudged) 9.9 points Claude Mythos Preview
anthropic-claude-mythos-preview
Imported 2026-05-15
2 Claude Mythos Preview 9.55 points Claude Mythos Preview
anthropic-claude-mythos-preview
Imported 2026-05-15
3 GPT 5.5 (Codex) (nudged) 5.51 points GPT-5.5
openai-gpt-5.5
Imported 2026-05-15
4 GPT 5.5 (nudged) 4.44 points GPT-5.5
openai-gpt-5.5
Imported 2026-05-15
5 GPT 5.5 (Codex) 4.3 points GPT-5.5
openai-gpt-5.5
Imported 2026-05-15
6 GPT 5.5 3.76 points GPT-5.5
openai-gpt-5.5
Imported 2026-05-15
7 Claude Opus 4.7 (nudged) 3.66 points Claude Opus 4.7
anthropic-claude-opus-4.7
Imported 2026-05-15
8 Gemini 3.1 Pro Preview 3.67 points Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-15
9 Claude Opus 4.7 3.46 points Claude Opus 4.7
anthropic-claude-opus-4.7
Imported 2026-05-15
10 Claude Sonnet 4.6 3.37 points Claude Sonnet 4.6
anthropic-claude-sonnet-4.6
Imported 2026-05-15
11 Claude Sonnet 4.6 (nudged) 3.17 points Claude Sonnet 4.6
anthropic-claude-sonnet-4.6
Imported 2026-05-15
12 Kimi K2.6 (nudged) 2.63 points KIMI MoonshotAI: Kimi K2.6
moonshotai-kimi-k2.6
Imported 2026-05-15
13 Glm 5.1 (nudged) 2.62 points GLM GLM 5.1
z-ai-glm-5.1
Imported 2026-05-15
14 Kimi K2.6 2.44 points KIMI MoonshotAI: Kimi K2.6
moonshotai-kimi-k2.6
Imported 2026-05-15
15 Glm 5.1 2.56 points GLM GLM 5.1
z-ai-glm-5.1
Imported 2026-05-15
16 Gemini 3.1 Pro Preview (nudged) 3.17 points Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-15
17 Claude Haiku 4.5 (nudged) 2.12 points Claude Haiku 4.5
anthropic-claude-haiku-4.5
Imported 2026-05-15
18 Claude Haiku 4.5 2.15 points Claude Haiku 4.5
anthropic-claude-haiku-4.5
Imported 2026-05-15
19 MiniMax M2.7 2.07 points MiniMax M2.7
minimax-minimax-m2.7
Imported 2026-05-15
20 MiniMax M2.7 (nudged) 2.06 points MiniMax M2.7
minimax-minimax-m2.7
Imported 2026-05-15