MATH 500
Academic math benchmark on probability, algebra, and trigonometry
60rows
scoreprimary metric
2026-01-09sampled
Metadata
Metrics
Score, Std. error (lower is better), Latency (lower is better), Cost per test (lower is better)
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Gemini 3 Pro Preview | 96.4% | Gemini 3 google-gemini-3 | Imported | 2026-01-09 |
| 2 | Grok 4.0709 | 96.2% | Grok 4 x-ai-grok-4 | Imported | 2026-01-09 |
| 3 | GPT 5.2025-08-07 | 96% | GPT-5 openai-gpt-5 | Imported | 2026-01-09 |
| 4 | Claude Opus 4.1 20250805 Thinking | 95.4% | Claude Opus 4.1 anthropic-claude-opus-4.1 | Imported | 2026-01-09 |
| 5 | Gemini 2.5 Pro Exp 03 25 | 95.2% | — | Imported | 2026-01-09 |
| 6 | GPT Oss 120B | 94.8% | gpt-oss-120b openai-gpt-oss-120b | Imported | 2026-01-09 |
| 7 | GPT 5 Mini 2025-08-07 | 94.8% | GPT-5 Mini openai-gpt-5-mini | Imported | 2026-01-09 |
| 8 | Qwen 3 235B A22b | 94.6% | Qwen3 235B A22B qwen-qwen3-235b-a22b | Imported | 2026-01-09 |
| 9 | O3 2025-04-16 | 94.6% | o3 openai-o3 | Imported | 2026-01-09 |
| 10 | GPT Oss 20B | 94.2% | gpt-oss-20b openai-gpt-oss-20b | Imported | 2026-01-09 |
| 11 | Grok 3 Mini Fast High Reasoning | 94.2% | — | Imported | 2026-01-09 |
| 12 | O4 Mini 2025-04-16 | 94.2% | o4 Mini openai-o4-mini | Imported | 2026-01-09 |
| 13 | Kimi K2 Instruct | 94.2% | MoonshotAI: Kimi K2 0711 moonshotai-kimi-k2 | Imported | 2026-01-09 |
| 14 | GLM 4.5 | 94% | GLM 4.5 z-ai-glm-4.5 | Imported | 2026-01-09 |
| 15 | Claude Sonnet 4.20250514 Thinking | 93.8% | — | Imported | 2026-01-09 |
| 16 | GPT 5 Nano 2025-08-07 | 93.8% | GPT-5 Nano openai-gpt-5-nano | Imported | 2026-01-09 |
| 17 | Claude Opus 4.1 20250805 | 93% | Claude Opus 4.1 anthropic-claude-opus-4.1 | Imported | 2026-01-09 |
| 18 | DeepSeek R1 | 92.2% | R1 deepseek-r1 | Imported | 2026-01-09 |
| 19 | Gemini 2.5 Flash Preview 04 17 Thinking | 91.8% | — | Imported | 2026-01-09 |
| 20 | O3 Mini 2025-01-31 | 91.8% | o3-mini openai-o3-mini | Imported | 2026-01-09 |
| 21 | Claude 3 7 Sonnet 20250219 Thinking | 91.6% | — | Imported | 2026-01-09 |
| 22 | Gemini 2.5 Flash Preview 04 17 | 91.6% | — | Imported | 2026-01-09 |
| 23 | Llama 3.3 Nemotron Super 49B V1 42e84561 Thinking | 91.4% | — | Imported | 2026-01-09 |
| 24 | Claude Opus 4.20250514 | 90.4% | Claude Opus 4 anthropic-claude-opus-4 | Imported | 2026-01-09 |
| 25 | O1 2024-12-17 | 90.4% | o1 openai-o1 | Imported | 2026-01-09 |
| 26 | Claude Sonnet 4.20250514 | 90.323% | Claude Sonnet 4 anthropic-claude-sonnet-4 | Imported | 2026-01-09 |
| 27 | Grok 3 | 89.8% | Grok 3 xaigrok-3 | Imported | 2026-01-09 |
| 28 | Gemini 2.0 Flash Exp | 89% | — | Imported | 2026-01-09 |
| 29 | MiniMax M2.1 | 89% | MiniMax M2.1 minimax-minimax-m2.1 | Imported | 2026-01-09 |
| 30 | DeepSeek V3 0324 | 88.6% | DeepSeek V3 0324 deepseek-deepseek-chat-v3-0324 | Imported | 2026-01-09 |
| 31 | Gemini 2.0 Flash 001 | 88% | Gemini 2.0 Flash google-gemini-2.0-flash | Imported | 2026-01-09 |
| 32 | GPT 4.1 Mini 2025-04-14 | 88% | GPT-4.1 Mini openai-gpt-4.1-mini | Imported | 2026-01-09 |
| 33 | GPT 4.1 2025-04-14 | 87.2% | GPT-4.1 openai-gpt-4.1 | Imported | 2026-01-09 |
| 34 | Mistral Medium 2505 | 87% | — | Imported | 2026-01-09 |
| 35 | Llama4 Maverick Instruct Basic | 85.2% | — | Imported | 2026-01-09 |
| 36 | Gemini 2.0 Flash Thinking Exp 01 21 | 84.6% | — | Imported | 2026-01-09 |
| 37 | Gemini 1.5 Pro 002 | 82.8% | — | Imported | 2026-01-09 |
| 38 | DeepSeek V3 | 80.4% | DeepSeek V3 deepseek-deepseek-chat | Imported | 2026-01-09 |
| 39 | GPT 4.1 Nano 2025-04-14 | 80.2% | GPT-4.1 Nano openai-gpt-4.1-nano | Imported | 2026-01-09 |
| 40 | Llama 4 Scout 17B 16E Instruct | 79.2% | Llama 4 Scout meta-llama-llama-4-scout | Imported | 2026-01-09 |
| 41 | Gemini 1.5 Flash 002 | 78.8% | — | Imported | 2026-01-09 |
| 42 | Grok 2.1212 | 78.4% | — | Imported | 2026-01-09 |
| 43 | Claude 3 7 Sonnet 20250219 | 76.8% | Claude 3.7 Sonnet anthropic-claude-3.7-sonnet | Imported | 2026-01-09 |
| 44 | Command A 03 2025 | 76.2% | Command A cohere-command-a | Imported | 2026-01-09 |
| 45 | GPT 4O 2024-08-06 | 75.2% | GPT-4o (2024-08-06) openai-gpt-4o-2024-08-06 | Imported | 2026-01-09 |
| 46 | Mistral Large 2411 | 74.4% | Mistral Large 2411 mistralai-mistral-large-2411 | Imported | 2026-01-09 |
| 47 | GPT 4O 2024-11-20 | 74% | GPT-4o (2024-11-20) openai-gpt-4o-2024-11-20 | Imported | 2026-01-09 |
| 48 | Llama 3.3 70B Instruct Turbo | 73.4% | — | Imported | 2026-01-09 |
| 49 | GPT 4O Mini 2024-07-18 | 72.6% | GPT-4o-mini (2024-07-18) openai-gpt-4o-mini-2024-07-18 | Imported | 2026-01-09 |
| 50 | Claude 3 5 Sonnet 20241022 | 72.4% | Claude 3.5 Sonnet anthropic-claude-3.5-sonnet | Imported | 2026-01-09 |
| 51 | Meta Llama 3.1 405B Instruct Turbo | 71.4% | — | Imported | 2026-01-09 |
| 52 | Llama 3.3 Nemotron Super 49B V1 42e84561 | 71.2% | — | Imported | 2026-01-09 |
| 53 | Mistral Small 2402 | 70.6% | — | Imported | 2026-01-09 |
| 54 | Grok 3 Mini Fast Low Reasoning | 70.2% | — | Imported | 2026-01-09 |
| 55 | Mistral Small 2503 | 68.4% | — | Imported | 2026-01-09 |
| 56 | Meta Llama 3.1 70B Instruct Turbo | 65% | — | Imported | 2026-01-09 |
| 57 | Claude 3 5 Haiku 20241022 | 64.2% | — | Imported | 2026-01-09 |
| 58 | Jamba Large 1.6 | 54.8% | — | Imported | 2026-01-09 |
| 59 | Meta Llama 3.1 8B Instruct Turbo | 44.4% | — | Imported | 2026-01-09 |
| 60 | Jamba Mini 1.6 | 25.4% | — | Imported | 2026-01-09 |
No matching rows.