ProofBench
Automated theorem proving benchmark
34rows
scoreprimary metric
2026-05-28sampled
Metadata
Metrics
Score, Std. error (lower is better), Latency (lower is better), Cost per test (lower is better)
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Aristotle | 71% | — | Imported | 2026-05-28 |
| 2 | Claude Opus 4.8 | 69% | Claude Opus 4.8 anthropic-claude-opus-4.8 | Imported | 2026-05-28 |
| 3 | GPT 5.4 2026-03-05 | 56% | GPT-5.4 openai-gpt-5.4 | Imported | 2026-05-28 |
| 4 | Claude Opus 4.7 | 54% | Claude Opus 4.7 anthropic-claude-opus-4.7 | Imported | 2026-05-28 |
| 5 | Claude Opus 4.6 Thinking | 50% | Claude Opus 4.6 anthropic-claude-opus-4.6 | Imported | 2026-05-28 |
| 6 | GPT 5.5 | 50% | GPT-5.5 openai-gpt-5.5 | Imported | 2026-05-28 |
| 7 | Claude Sonnet 4.6 | 45% | Claude Sonnet 4.6 anthropic-claude-sonnet-4.6 | Imported | 2026-05-28 |
| 8 | Claude Opus 4.5 20251101 Thinking | 36% | Claude Opus 4.5 anthropic-claude-opus-4.5 | Imported | 2026-05-28 |
| 9 | Gemini 3.5 Flash | 29% | Gemini 3.5 Flash google-gemini-3.5-flash | Imported | 2026-05-28 |
| 10 | Qwen 3.7 Max | 26% | Qwen3.7 Max qwen-qwen3.7-max | Imported | 2026-05-28 |
| 11 | Gemini 3.1 Pro Preview | 26% | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Imported | 2026-05-28 |
| 12 | GLM 5.1 Thinking | 22.222% | GLM 5.1 z-ai-glm-5.1 | Imported | 2026-05-28 |
| 13 | GPT 5.4 Mini 2026-03-17 | 21% | GPT-5.4 Mini openai-gpt-5.4-mini | Imported | 2026-05-28 |
| 14 | Gemini 3 Pro Preview | 20% | Gemini 3 google-gemini-3 | Imported | 2026-05-28 |
| 15 | Claude Sonnet 4.5 20250929 Thinking | 19% | Claude Sonnet 4.5 anthropic-claude-sonnet-4.5 | Imported | 2026-05-28 |
| 16 | GPT 5.2025-08-07 | 18% | GPT-5 openai-gpt-5 | Imported | 2026-05-28 |
| 17 | Muse Spark | 17% | — | Imported | 2026-05-28 |
| 18 | Kimi K2.6 Thinking | 16% | MoonshotAI: Kimi K2.6 moonshotai-kimi-k2.6 | Imported | 2026-05-28 |
| 19 | Gemini 3 Flash Preview | 15% | Gemini 3 Flash Preview google-gemini-3-flash-preview | Imported | 2026-05-28 |
| 20 | GPT 5.2 2025-12-11 | 15% | GPT-5.2 openai-gpt-5.2 | Imported | 2026-05-28 |
| 21 | Grok 4.20 0309 Reasoning | 14% | Grok 4.20 x-ai-grok-4.20 | Imported | 2026-05-28 |
| 22 | GPT 5 Nano 2025-08-07 | 12% | GPT-5 Nano openai-gpt-5-nano | Imported | 2026-05-28 |
| 23 | Grok 4.3 | 11% | Grok 4.3 x-ai-grok-4.3 | Imported | 2026-05-28 |
| 24 | DeepSeek V4 Pro | 10% | DeepSeek V4 Pro deepseek-deepseek-v4-pro | Imported | 2026-05-28 |
| 25 | GPT 5 Mini 2025-08-07 | 9% | GPT-5 Mini openai-gpt-5-mini | Imported | 2026-05-28 |
| 26 | GPT 5.1 Codex Max | 9% | GPT-5.1-Codex-Max openai-gpt-5.1-codex-max | Imported | 2026-05-28 |
| 27 | Qwen 3.6 27B | 8% | Qwen3.6 27B qwen-qwen3.6-27b | Imported | 2026-05-28 |
| 28 | DeepSeek V3P2 | 8% | — | Imported | 2026-05-28 |
| 29 | GLM 4.7 | 6% | GLM 4.7 z-ai-glm-4.7 | Imported | 2026-05-28 |
| 30 | GPT 5.4 Nano 2026-03-17 | 5% | GPT-5.4 Nano openai-gpt-5.4-nano | Imported | 2026-05-28 |
| 31 | DeepSeek V3P2 Thinking | 4% | — | Imported | 2026-05-28 |
| 32 | Grok 4.1 Fast Reasoning | 4% | Grok 4.1 Fast x-ai-grok-4.1-fast | Imported | 2026-05-28 |
| 33 | MiniMax M2.5 Lightning | 4% | — | Imported | 2026-05-28 |
| 34 | MiniMax M2.7 | 3% | MiniMax M2.7 minimax-minimax-m2.7 | Imported | 2026-05-28 |
No matching rows.