LiveMathematicianBench

Live benchmark for research-level theorem comprehension, with monthly multiple-choice questions derived from newly published arXiv mathematics papers.

9rows
accuracyprimary metric
2026-05-28sampled

Metadata

Metrics

Accuracy, Correct, Total, Output Tokens / Task (lower is better)

Latest Results

Rows are imported from the official LiveMathematicianBench homepage embedded JavaScript data. The random-guessing baseline is included as a non-model row.

Rank Subject Accuracy Model Match Provenance Sampled
1 Gemini 3.1 Pro Preview (high) 43.5% Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-28
2 GPT-5.4 (high) 41.8% GPT-5.4
openai-gpt-5.4
Imported 2026-05-28
3 GPT-5.4 (medium) 41.2% GPT-5.4
openai-gpt-5.4
Imported 2026-05-28
4 Qwen3.5-397B-A17B (enabled) 35.6% Qwen3.5 397B A17B
qwen-qwen3.5-397b-a17b
Imported 2026-05-28
5 Kimi-K2.5 (enabled) 35.0% KIMI MoonshotAI: Kimi K2.5
moonshotai-kimi-k2.5
Imported 2026-05-28
6 GPT-OSS-120B (high) 28.8% gpt-oss-120b
openai-gpt-oss-120b
Imported 2026-05-28
7 Grok-4.1 Fast Reasoning (high) 25.4% GROK Grok 4.1 Fast
x-ai-grok-4.1-fast
Imported 2026-05-28
8 MiniMax-M2.5 (enabled) 22.0% MiniMax M2.5
minimax-minimax-m2.5
Imported 2026-05-28
9 Random 20.0% Imported 2026-05-28