ProofBench

Automated theorem proving benchmark

34rows
scoreprimary metric
2026-05-28sampled

Metadata

Metrics

Score, Std. error (lower is better), Latency (lower is better), Cost per test (lower is better)

Latest Results

Full leaderboard rows decoded from the Vals.ai benchmark detail page. Primary score is the Overall accuracy percentage.

Rank Subject Score Model Match Provenance Sampled
1 Aristotle 71% Imported 2026-05-28
2 Claude Opus 4.8 69% Claude Opus 4.8
anthropic-claude-opus-4.8
Imported 2026-05-28
3 GPT 5.4 2026-03-05 56% GPT-5.4
openai-gpt-5.4
Imported 2026-05-28
4 Claude Opus 4.7 54% Claude Opus 4.7
anthropic-claude-opus-4.7
Imported 2026-05-28
5 Claude Opus 4.6 Thinking 50% Claude Opus 4.6
anthropic-claude-opus-4.6
Imported 2026-05-28
6 GPT 5.5 50% GPT-5.5
openai-gpt-5.5
Imported 2026-05-28
7 Claude Sonnet 4.6 45% Claude Sonnet 4.6
anthropic-claude-sonnet-4.6
Imported 2026-05-28
8 Claude Opus 4.5 20251101 Thinking 36% Claude Opus 4.5
anthropic-claude-opus-4.5
Imported 2026-05-28
9 Gemini 3.5 Flash 29% Gemini 3.5 Flash
google-gemini-3.5-flash
Imported 2026-05-28
10 Qwen 3.7 Max 26% Qwen3.7 Max
qwen-qwen3.7-max
Imported 2026-05-28
11 Gemini 3.1 Pro Preview 26% Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-28
12 GLM 5.1 Thinking 22.222% GLM GLM 5.1
z-ai-glm-5.1
Imported 2026-05-28
13 GPT 5.4 Mini 2026-03-17 21% GPT-5.4 Mini
openai-gpt-5.4-mini
Imported 2026-05-28
14 Gemini 3 Pro Preview 20% Gemini 3
google-gemini-3
Imported 2026-05-28
15 Claude Sonnet 4.5 20250929 Thinking 19% Claude Sonnet 4.5
anthropic-claude-sonnet-4.5
Imported 2026-05-28
16 GPT 5.2025-08-07 18% GPT-5
openai-gpt-5
Imported 2026-05-28
17 Muse Spark 17% Imported 2026-05-28
18 Kimi K2.6 Thinking 16% KIMI MoonshotAI: Kimi K2.6
moonshotai-kimi-k2.6
Imported 2026-05-28
19 Gemini 3 Flash Preview 15% Gemini 3 Flash Preview
google-gemini-3-flash-preview
Imported 2026-05-28
20 GPT 5.2 2025-12-11 15% GPT-5.2
openai-gpt-5.2
Imported 2026-05-28
21 Grok 4.20 0309 Reasoning 14% GROK Grok 4.20
x-ai-grok-4.20
Imported 2026-05-28
22 GPT 5 Nano 2025-08-07 12% GPT-5 Nano
openai-gpt-5-nano
Imported 2026-05-28
23 Grok 4.3 11% GROK Grok 4.3
x-ai-grok-4.3
Imported 2026-05-28
24 DeepSeek V4 Pro 10% DeepSeek V4 Pro
deepseek-deepseek-v4-pro
Imported 2026-05-28
25 GPT 5 Mini 2025-08-07 9% GPT-5 Mini
openai-gpt-5-mini
Imported 2026-05-28
26 GPT 5.1 Codex Max 9% GPT-5.1-Codex-Max
openai-gpt-5.1-codex-max
Imported 2026-05-28
27 Qwen 3.6 27B 8% Qwen3.6 27B
qwen-qwen3.6-27b
Imported 2026-05-28
28 DeepSeek V3P2 8% Imported 2026-05-28
29 GLM 4.7 6% GLM GLM 4.7
z-ai-glm-4.7
Imported 2026-05-28
30 GPT 5.4 Nano 2026-03-17 5% GPT-5.4 Nano
openai-gpt-5.4-nano
Imported 2026-05-28
31 DeepSeek V3P2 Thinking 4% Imported 2026-05-28
32 Grok 4.1 Fast Reasoning 4% GROK Grok 4.1 Fast
x-ai-grok-4.1-fast
Imported 2026-05-28
33 MiniMax M2.5 Lightning 4% Imported 2026-05-28
34 MiniMax M2.7 3% MiniMax M2.7
minimax-minimax-m2.7
Imported 2026-05-28