ExploitBench v8-bench | BenchmarkList

Metadata

ID: exploitbench_v8
Category: Cybersecurity
Release: 2026-05-13
Source: Source page
Snapshot: Snapshot source

Metrics

Mean score, Mean capability, T1 full control envs, T2 generic primitive envs, T3 target primitive envs, T4 reproduction envs, T5 coverage envs, No capability envs (lower is better), Environments, Episodes, Spend (lower is better)

Showing 2 latest source slices.

Rank	Subject	Mean score	Model Match	Provenance	Sampled
1	Claude Mythos Preview (AutoNudge)	9.9 points	Claude Mythos Preview anthropic-claude-mythos-preview	Self-reported	2026-05-28
2	Claude Mythos Preview (plain)	9.55 points	Claude Mythos Preview anthropic-claude-mythos-preview	Self-reported	2026-05-28
3	Claude Opus 4.8 (AutoNudge)	5.45 points	Claude Opus 4.8 anthropic-claude-opus-4.8	Self-reported	2026-05-28
4	Claude Opus 4.8 (plain)	5.02 points	Claude Opus 4.8 anthropic-claude-opus-4.8	Self-reported	2026-05-28
5	Claude Opus 4.7 (AutoNudge)	3.66 points	Claude Opus 4.7 anthropic-claude-opus-4.7	Self-reported	2026-05-28
6	Claude Opus 4.7 (plain)	3.46 points	Claude Opus 4.7 anthropic-claude-opus-4.7	Self-reported	2026-05-28
7	Claude Sonnet 4.6 (plain)	3.37 points	Claude Sonnet 4.6 anthropic-claude-sonnet-4.6	Self-reported	2026-05-28
8	Claude Sonnet 4.6 (AutoNudge)	3.17 points	Claude Sonnet 4.6 anthropic-claude-sonnet-4.6	Self-reported	2026-05-28
1	Claude Mythos Preview (nudged)	9.9 points	Claude Mythos Preview anthropic-claude-mythos-preview	Imported	2026-05-15
2	Claude Mythos Preview	9.55 points	Claude Mythos Preview anthropic-claude-mythos-preview	Imported	2026-05-15
3	GPT 5.5 (Codex) (nudged)	5.51 points	GPT-5.5 openai-gpt-5.5	Imported	2026-05-15
4	GPT 5.5 (nudged)	4.44 points	GPT-5.5 openai-gpt-5.5	Imported	2026-05-15
5	GPT 5.5 (Codex)	4.3 points	GPT-5.5 openai-gpt-5.5	Imported	2026-05-15
6	GPT 5.5	3.76 points	GPT-5.5 openai-gpt-5.5	Imported	2026-05-15
7	Claude Opus 4.7 (nudged)	3.66 points	Claude Opus 4.7 anthropic-claude-opus-4.7	Imported	2026-05-15
8	Gemini 3.1 Pro Preview	3.67 points	Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview	Imported	2026-05-15
9	Claude Opus 4.7	3.46 points	Claude Opus 4.7 anthropic-claude-opus-4.7	Imported	2026-05-15
10	Claude Sonnet 4.6	3.37 points	Claude Sonnet 4.6 anthropic-claude-sonnet-4.6	Imported	2026-05-15
11	Claude Sonnet 4.6 (nudged)	3.17 points	Claude Sonnet 4.6 anthropic-claude-sonnet-4.6	Imported	2026-05-15
12	Kimi K2.6 (nudged)	2.63 points	KIMI MoonshotAI: Kimi K2.6 moonshotai-kimi-k2.6	Imported	2026-05-15
13	Glm 5.1 (nudged)	2.62 points	GLM GLM 5.1 z-ai-glm-5.1	Imported	2026-05-15
14	Kimi K2.6	2.44 points	KIMI MoonshotAI: Kimi K2.6 moonshotai-kimi-k2.6	Imported	2026-05-15
15	Glm 5.1	2.56 points	GLM GLM 5.1 z-ai-glm-5.1	Imported	2026-05-15
16	Gemini 3.1 Pro Preview (nudged)	3.17 points	Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview	Imported	2026-05-15
17	Claude Haiku 4.5 (nudged)	2.12 points	Claude Haiku 4.5 anthropic-claude-haiku-4.5	Imported	2026-05-15
18	Claude Haiku 4.5	2.15 points	Claude Haiku 4.5 anthropic-claude-haiku-4.5	Imported	2026-05-15
19	MiniMax M2.7	2.07 points	MiniMax M2.7 minimax-minimax-m2.7	Imported	2026-05-15
20	MiniMax M2.7 (nudged)	2.06 points	MiniMax M2.7 minimax-minimax-m2.7	Imported	2026-05-15

Metadata

Metrics

Latest Results