LiveMathematicianBench
Live benchmark for research-level theorem comprehension, with monthly multiple-choice questions derived from newly published arXiv mathematics papers.
9rows
accuracyprimary metric
2026-05-28sampled
Metadata
Metrics
Accuracy, Correct, Total, Output Tokens / Task (lower is better)
| Rank | Subject | Accuracy | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Gemini 3.1 Pro Preview (high) | 43.5% | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Imported | 2026-05-28 |
| 2 | GPT-5.4 (high) | 41.8% | GPT-5.4 openai-gpt-5.4 | Imported | 2026-05-28 |
| 3 | GPT-5.4 (medium) | 41.2% | GPT-5.4 openai-gpt-5.4 | Imported | 2026-05-28 |
| 4 | Qwen3.5-397B-A17B (enabled) | 35.6% | Qwen3.5 397B A17B qwen-qwen3.5-397b-a17b | Imported | 2026-05-28 |
| 5 | Kimi-K2.5 (enabled) | 35.0% | MoonshotAI: Kimi K2.5 moonshotai-kimi-k2.5 | Imported | 2026-05-28 |
| 6 | GPT-OSS-120B (high) | 28.8% | gpt-oss-120b openai-gpt-oss-120b | Imported | 2026-05-28 |
| 7 | Grok-4.1 Fast Reasoning (high) | 25.4% | Grok 4.1 Fast x-ai-grok-4.1-fast | Imported | 2026-05-28 |
| 8 | MiniMax-M2.5 (enabled) | 22.0% | MiniMax M2.5 minimax-minimax-m2.5 | Imported | 2026-05-28 |
| 9 | Random | 20.0% | — | Imported | 2026-05-28 |
No matching rows.