MGSM
A multilingual benchmark for mathematical questions.
75rows
scoreprimary metric
2026-01-09sampled
Metadata
Metrics
Score, Std. error (lower is better), Latency (lower is better), Cost per test (lower is better)
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.5 20251101 Thinking | 95.2% | Claude Opus 4.5 anthropic-claude-opus-4.5 | Imported | 2026-01-09 |
| 2 | Claude Opus 4.5 20251101 | 94.764% | Claude Opus 4.5 anthropic-claude-opus-4.5 | Imported | 2026-01-09 |
| 3 | Claude Opus 4.1 20250805 Thinking | 94.436% | Claude Opus 4.1 anthropic-claude-opus-4.1 | Imported | 2026-01-09 |
| 4 | Claude Sonnet 4.5 20250929 Thinking | 94.327% | Claude Sonnet 4.5 anthropic-claude-sonnet-4.5 | Imported | 2026-01-09 |
| 5 | Claude Opus 4.1 20250805 | 94.218% | Claude Opus 4.1 anthropic-claude-opus-4.1 | Imported | 2026-01-09 |
| 6 | GPT 5.2 2025-12-11 | 94% | GPT-5.2 openai-gpt-5.2 | Imported | 2026-01-09 |
| 7 | Gemini 3 Pro Preview | 93.927% | Gemini 3 google-gemini-3 | Imported | 2026-01-09 |
| 8 | Claude Opus 4.20250514 | 93.782% | Claude Opus 4 anthropic-claude-opus-4 | Imported | 2026-01-09 |
| 9 | O4 Mini 2025-04-16 | 93.418% | o4 Mini openai-o4-mini | Imported | 2026-01-09 |
| 10 | Gemini 3 Flash Preview | 93.309% | Gemini 3 Flash Preview google-gemini-3-flash-preview | Imported | 2026-01-09 |
| 11 | Claude Sonnet 4.20250514 | 93.018% | Claude Sonnet 4 anthropic-claude-sonnet-4 | Imported | 2026-01-09 |
| 12 | Claude 3 7 Sonnet 20250219 Thinking | 92.982% | — | Imported | 2026-01-09 |
| 13 | GPT 5.1 2025-11-13 | 92.982% | GPT-5.1 openai-gpt-5.1 | Imported | 2026-01-09 |
| 14 | GPT 5.2025-08-07 | 92.836% | GPT-5 openai-gpt-5 | Imported | 2026-01-09 |
| 15 | Claude 3 5 Sonnet 20241022 | 92.582% | Claude 3.5 Sonnet anthropic-claude-3.5-sonnet | Imported | 2026-01-09 |
| 16 | GPT 5 Mini 2025-08-07 | 92.582% | GPT-5 Mini openai-gpt-5-mini | Imported | 2026-01-09 |
| 17 | Qwen 3 235B A22b | 92.473% | Qwen3 235B A22B qwen-qwen3-235b-a22b | Imported | 2026-01-09 |
| 18 | Llama4 Maverick Instruct Basic | 92.436% | — | Imported | 2026-01-09 |
| 19 | Claude 3 7 Sonnet 20250219 | 92.4% | Claude 3.7 Sonnet anthropic-claude-3.7-sonnet | Imported | 2026-01-09 |
| 20 | DeepSeek R1 | 92.254% | R1 deepseek-r1 | Imported | 2026-01-09 |
| 21 | Qwen 3 Max Preview | 92.146% | — | Imported | 2026-01-09 |
| 22 | Claude Haiku 4.5 20251001 Thinking | 92.146% | Claude Haiku 4.5 anthropic-claude-haiku-4.5 | Imported | 2026-01-09 |
| 23 | DeepSeek V3 | 92.146% | DeepSeek V3 deepseek-deepseek-chat | Imported | 2026-01-09 |
| 24 | GPT Oss 120B | 92.036% | gpt-oss-120b openai-gpt-oss-120b | Imported | 2026-01-09 |
| 25 | Qwen 3 Max | 91.818% | Qwen3 Max qwen-qwen3-max | Imported | 2026-01-09 |
| 26 | O3 2025-04-16 | 91.746% | o3 openai-o3 | Imported | 2026-01-09 |
| 27 | DeepSeek V3 0324 | 91.673% | DeepSeek V3 0324 deepseek-deepseek-chat-v3-0324 | Imported | 2026-01-09 |
| 28 | Grok 3 | 91.346% | Grok 3 xaigrok-3 | Imported | 2026-01-09 |
| 29 | O3 Mini 2025-01-31 | 91.346% | o3-mini openai-o3-mini | Imported | 2026-01-09 |
| 30 | Llama 3.3 70B Instruct Turbo | 91.091% | — | Imported | 2026-01-09 |
| 31 | Kimi K2 Instruct | 90.946% | MoonshotAI: Kimi K2 0711 moonshotai-kimi-k2 | Imported | 2026-01-09 |
| 32 | Grok 4.0709 | 90.909% | Grok 4 x-ai-grok-4 | Imported | 2026-01-09 |
| 33 | Claude Sonnet 4.20250514 Thinking | 90.873% | — | Imported | 2026-01-09 |
| 34 | DeepSeek V3P2 | 90.873% | — | Imported | 2026-01-09 |
| 35 | Grok 4 Fast Reasoning | 90.873% | Grok 4 Fast x-ai-grok-4-fast | Imported | 2026-01-09 |
| 36 | Mistral Medium 2505 | 90.873% | — | Imported | 2026-01-09 |
| 37 | GLM 4.5 | 90.836% | GLM 4.5 z-ai-glm-4.5 | Imported | 2026-01-09 |
| 38 | GPT 4O 2024-08-06 | 90.691% | GPT-4o (2024-08-06) openai-gpt-4o-2024-08-06 | Imported | 2026-01-09 |
| 39 | Grok 3 Mini Fast High Reasoning | 90.436% | — | Imported | 2026-01-09 |
| 40 | Grok 3 Mini Fast Low Reasoning | 90.364% | — | Imported | 2026-01-09 |
| 41 | GPT 4O 2024-11-20 | 90.364% | GPT-4o (2024-11-20) openai-gpt-4o-2024-11-20 | Imported | 2026-01-09 |
| 42 | Kimi K2 Thinking | 90.146% | MoonshotAI: Kimi K2 Thinking moonshotai-kimi-k2-thinking | Imported | 2026-01-09 |
| 43 | Gemini 2.5 Flash Preview 09 2025 | 89.854% | — | Imported | 2026-01-09 |
| 44 | Gemini 2.5 Flash Preview 09 2025 Thinking | 89.818% | — | Imported | 2026-01-09 |
| 45 | GLM 4.6 | 89.746% | GLM 4.6 z-ai-glm-4.6 | Imported | 2026-01-09 |
| 46 | Gemini 2.5 Flash Lite Preview 09 2025 | 89.527% | Gemini 2.5 Flash Lite Preview 09-2025 google-gemini-2.5-flash-lite-preview-09-2025 | Imported | 2026-01-09 |
| 47 | Grok 4.1 Fast Reasoning | 89.527% | Grok 4.1 Fast x-ai-grok-4.1-fast | Imported | 2026-01-09 |
| 48 | GPT 5 Nano 2025-08-07 | 89.309% | GPT-5 Nano openai-gpt-5-nano | Imported | 2026-01-09 |
| 49 | O1 2024-12-17 | 89.309% | o1 openai-o1 | Imported | 2026-01-09 |
| 50 | Gemini 1.5 Pro 002 | 89.2% | — | Imported | 2026-01-09 |
| 51 | GPT Oss 20B | 89.018% | gpt-oss-20b openai-gpt-oss-20b | Imported | 2026-01-09 |
| 52 | Gemini 2.0 Flash 001 | 89.018% | Gemini 2.0 Flash google-gemini-2.0-flash | Imported | 2026-01-09 |
| 53 | Gemini 2.5 Flash Lite Preview 09 2025 Thinking | 88.4% | — | Imported | 2026-01-09 |
| 54 | GLM 4.7 | 88.182% | GLM 4.7 z-ai-glm-4.7 | Imported | 2026-01-09 |
| 55 | Grok 4 Fast Non Reasoning | 88% | Grok 4 Fast x-ai-grok-4-fast | Imported | 2026-01-09 |
| 56 | Llama 4 Scout 17B 16E Instruct | 87.964% | Llama 4 Scout meta-llama-llama-4-scout | Imported | 2026-01-09 |
| 57 | MiniMax M2.1 | 87.854% | MiniMax M2.1 minimax-minimax-m2.1 | Imported | 2026-01-09 |
| 58 | GPT 4.1 Mini 2025-04-14 | 87.782% | GPT-4.1 Mini openai-gpt-4.1-mini | Imported | 2026-01-09 |
| 59 | GPT 4.1 2025-04-14 | 87.673% | GPT-4.1 openai-gpt-4.1 | Imported | 2026-01-09 |
| 60 | Grok 4.1 Fast Non Reasoning | 87.564% | Grok 4.1 Fast x-ai-grok-4.1-fast | Imported | 2026-01-09 |
| 61 | Mistral Large 2411 | 87.236% | Mistral Large 2411 mistralai-mistral-large-2411 | Imported | 2026-01-09 |
| 62 | Gemini 1.5 Flash 002 | 86.582% | — | Imported | 2026-01-09 |
| 63 | Magistral Small 2509 | 86.254% | — | Imported | 2026-01-09 |
| 64 | GPT 4O Mini 2024-07-18 | 86.182% | GPT-4o-mini (2024-07-18) openai-gpt-4o-mini-2024-07-18 | Imported | 2026-01-09 |
| 65 | Grok 2.1212 | 86.146% | — | Imported | 2026-01-09 |
| 66 | DeepSeek V3P2 Thinking | 86.036% | — | Imported | 2026-01-09 |
| 67 | Command A 03 2025 | 85.709% | Command A cohere-command-a | Imported | 2026-01-09 |
| 68 | Mistral Large 2512 | 85.418% | Mistral: Mistral Large 3 2512 mistralai-mistral-large-2512 | Imported | 2026-01-09 |
| 69 | Claude 3 5 Haiku 20241022 | 84.618% | — | Imported | 2026-01-09 |
| 70 | Mistral Small 2503 | 84.218% | — | Imported | 2026-01-09 |
| 71 | Mistral Small 2402 | 83.964% | — | Imported | 2026-01-09 |
| 72 | Magistral Medium 2509 | 74.618% | — | Imported | 2026-01-09 |
| 73 | Jamba Large 1.6 | 71.236% | — | Imported | 2026-01-09 |
| 74 | GPT 4.1 Nano 2025-04-14 | 69.273% | GPT-4.1 Nano openai-gpt-4.1-nano | Imported | 2026-01-09 |
| 75 | Jamba Mini 1.6 | 41.709% | — | Imported | 2026-01-09 |
No matching rows.