MMLU-Redux
An improved version of the MMLU benchmark featuring manually re-annotated questions to identify and correct errors in the original dataset. Provides more reliable evaluation metrics for language models by addressing dataset quality issues found in the original MMLU.
51rows
scoreprimary metric
2026-05-28sampled
Metadata
Metrics
Score, Normalized Score
Showing 2 latest source slices.
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Kimi K2.6 Thinking | 95.3% | MoonshotAI: Kimi K2.6 moonshotai-kimi-k2.6 | Self-reported | 2026-05-28 |
| 2 | Claude Opus 4.6 Max | 95.2% | Claude Opus 4.6 anthropic-claude-opus-4.6 | Self-reported | 2026-05-28 |
| 3 | Qwen3.7 Max | 95% | Qwen3.7 Max qwen-qwen3.7-max | Self-reported | 2026-05-28 |
| 4 | DeepSeek V4 Pro Max | 94.8% | DeepSeek V4 Pro deepseek-deepseek-v4-pro | Self-reported | 2026-05-28 |
| 5 | Qwen3.6 Plus | 94.5% | Qwen3.6 Plus qwen-qwen3.6-plus | Self-reported | 2026-05-28 |
| 6 | GLM-5.1 Thinking | 94.3% | GLM 5.1 z-ai-glm-5.1 | Self-reported | 2026-05-28 |
| 1 | Qwen3.5-397B-A17B | 0.95 | Qwen3.5 397B A17B qwen-qwen3.5-397b-a17b | Self-reported | 2026-05-06 |
| 2 | Qwen3.6 Plus | 0.94 | Qwen3.6 Plus qwen-qwen3.6-plus | Self-reported | 2026-05-06 |
| 3 | Kimi K2-Thinking-0905 | 0.94 | MoonshotAI: Kimi K2 Thinking moonshotai-kimi-k2-thinking | Self-reported | 2026-05-06 |
| 4 | Qwen3.5-122B-A10B | 0.94 | Qwen3.5-122B-A10B qwen-qwen3.5-122b-a10b | Self-reported | 2026-05-06 |
| 5 | Qwen3-235B-A22B-Thinking-2507 | 0.94 | Qwen3 235B A22B Thinking 2507 qwen-qwen3-235b-a22b-thinking-2507 | Self-reported | 2026-05-06 |
| 6 | Qwen3 VL 235B A22B Thinking | 0.94 | Qwen3 VL 235B A22B Thinking qwen-qwen3-vl-235b-a22b-thinking | Self-reported | 2026-05-06 |
| 7 | Qwen3.6-27B | 0.94 | Qwen3.6 27B qwen-qwen3.6-27b | Self-reported | 2026-05-06 |
| 8 | DeepSeek-R1-0528 | 0.93 | R1 0528 deepseek-deepseek-r1-0528 | Self-reported | 2026-05-06 |
| 9 | Qwen3.6-35B-A3B | 0.93 | Qwen3.6 35B A3B qwen-qwen3.6-35b-a3b | Self-reported | 2026-05-06 |
| 9 | Qwen3.5-35B-A3B | 0.93 | Qwen3.5-35B-A3B qwen-qwen3.5-35b-a3b | Self-reported | 2026-05-06 |
| 11 | Qwen3.5-27B | 0.93 | Qwen3.5-27B qwen-qwen3.5-27b | Self-reported | 2026-05-06 |
| 12 | Qwen3-235B-A22B-Instruct-2507 | 0.93 | Qwen3 235B A22B Instruct 2507 qwen-qwen3-235b-a22b-2507 | Self-reported | 2026-05-06 |
| 13 | Kimi K2-Instruct-0905 | 0.93 | MoonshotAI: Kimi K2 0905 moonshotai-kimi-k2-0905 | Self-reported | 2026-05-06 |
| 13 | Kimi K2 Instruct | 0.93 | MoonshotAI: Kimi K2 0711 moonshotai-kimi-k2 | Self-reported | 2026-05-06 |
| 15 | Qwen3-Next-80B-A3B-Thinking | 0.93 | Qwen3 Next 80B A3B Thinking qwen-qwen3-next-80b-a3b-thinking | Self-reported | 2026-05-06 |
| 16 | Qwen3 VL 235B A22B Instruct | 0.92 | Qwen3 VL 235B A22B Instruct qwen-qwen3-vl-235b-a22b-instruct | Self-reported | 2026-05-06 |
| 17 | Qwen3 VL 32B Thinking | 0.92 | — | Self-reported | 2026-05-06 |
| 18 | DeepSeek-V3.1 | 0.92 | DeepSeek V3.1 deepseek-deepseek-chat-v3.1 | Self-reported | 2026-05-06 |
| 19 | Qwen3.5-9B | 0.91 | Qwen3.5-9B qwen-qwen3.5-9b | Self-reported | 2026-05-06 |
| 20 | Qwen3-Next-80B-A3B-Instruct | 0.91 | Qwen3 Next 80B A3B Instruct qwen-qwen3-next-80b-a3b-instruct | Self-reported | 2026-05-06 |
| 20 | Qwen3 VL 30B A3B Thinking | 0.91 | Qwen3 VL 30B A3B Thinking qwen-qwen3-vl-30b-a3b-thinking | Self-reported | 2026-05-06 |
| 22 | Qwen3 VL 32B Instruct | 0.90 | Qwen3 VL 32B Instruct qwen-qwen3-vl-32b-instruct | Self-reported | 2026-05-06 |
| 23 | LongCat-Flash-Thinking | 0.89 | — | Self-reported | 2026-05-06 |
| 24 | DeepSeek-V3 | 0.89 | DeepSeek V3 deepseek-deepseek-chat | Self-reported | 2026-05-06 |
| 25 | Qwen3 VL 8B Thinking | 0.89 | Qwen3 VL 8B Thinking qwen-qwen3-vl-8b-thinking | Self-reported | 2026-05-06 |
| 25 | Qwen3.5-4B | 0.89 | — | Self-reported | 2026-05-06 |
| 27 | Qwen3 VL 30B A3B Instruct | 0.88 | Qwen3 VL 30B A3B Instruct qwen-qwen3-vl-30b-a3b-instruct | Self-reported | 2026-05-06 |
| 28 | Qwen3 235B A22B | 0.87 | Qwen3 235B A22B qwen-qwen3-235b-a22b | Self-reported | 2026-05-06 |
| 29 | Qwen2.5 72B Instruct | 0.87 | Qwen2.5 72B Instruct qwen-qwen-2.5-72b-instruct | Self-reported | 2026-05-06 |
| 30 | Qwen3 VL 4B Thinking | 0.86 | — | Self-reported | 2026-05-06 |
| 31 | Qwen3 VL 8B Instruct | 0.85 | Qwen3 VL 8B Instruct qwen-qwen3-vl-8b-instruct | Self-reported | 2026-05-06 |
| 32 | Qwen2.5 32B Instruct | 0.84 | — | Self-reported | 2026-05-06 |
| 33 | Ministral 3 (14B Base 2512) | 0.82 | — | Self-reported | 2026-05-06 |
| 33 | Mistral Large 3 | 0.82 | — | Self-reported | 2026-05-06 |
| 35 | Qwen3 VL 4B Instruct | 0.81 | — | Self-reported | 2026-05-06 |
| 36 | Qwen2.5 14B Instruct | 0.80 | — | Self-reported | 2026-05-06 |
| 37 | Qwen3.5-2B | 0.80 | — | Self-reported | 2026-05-06 |
| 38 | Ministral 3 (8B Base 2512) | 0.79 | — | Self-reported | 2026-05-06 |
| 39 | Qwen2.5-Coder 32B Instruct | 0.78 | Qwen2.5 Coder 32B Instruct qwen-qwen-2.5-coder-32b-instruct | Self-reported | 2026-05-06 |
| 40 | Qwen2.5 7B Instruct | 0.75 | Qwen2.5 7B Instruct qwen-qwen-2.5-7b-instruct | Self-reported | 2026-05-06 |
| 41 | Ministral 3 (3B Base 2512) | 0.73 | — | Self-reported | 2026-05-06 |
| 42 | Qwen2.5-Omni-7B | 0.71 | — | Self-reported | 2026-05-06 |
| 43 | Qwen2.5-Coder 7B Instruct | 0.67 | — | Self-reported | 2026-05-06 |
| 44 | Qwen3.5-0.8B | 0.59 | — | Self-reported | 2026-05-06 |
| 45 | ERNIE 4.5 | 0.43 | ERNIE 4.5 300B A47B baidu-ernie-4.5-300b-a47b | Self-reported | 2026-05-06 |
No matching rows.