MMLU-Redux

An improved version of the MMLU benchmark featuring manually re-annotated questions to identify and correct errors in the original dataset. Provides more reliable evaluation metrics for language models by addressing dataset quality issues found in the original MMLU.

51rows
scoreprimary metric
2026-05-28sampled

Metadata

Metrics

Score, Normalized Score

Showing 2 latest source slices.

Latest Results

Provider-published Qwen3.7-Max comparison scores. Rows are marked self-reported and should be interpreted as source claims unless independently reproduced.

Rank Subject Score Model Match Provenance Sampled
1 Kimi K2.6 Thinking 95.3% KIMI MoonshotAI: Kimi K2.6
moonshotai-kimi-k2.6
Self-reported 2026-05-28
2 Claude Opus 4.6 Max 95.2% Claude Opus 4.6
anthropic-claude-opus-4.6
Self-reported 2026-05-28
3 Qwen3.7 Max 95% Qwen3.7 Max
qwen-qwen3.7-max
Self-reported 2026-05-28
4 DeepSeek V4 Pro Max 94.8% DeepSeek V4 Pro
deepseek-deepseek-v4-pro
Self-reported 2026-05-28
5 Qwen3.6 Plus 94.5% Qwen3.6 Plus
qwen-qwen3.6-plus
Self-reported 2026-05-28
6 GLM-5.1 Thinking 94.3% GLM GLM 5.1
z-ai-glm-5.1
Self-reported 2026-05-28
1 Qwen3.5-397B-A17B 0.95 Qwen3.5 397B A17B
qwen-qwen3.5-397b-a17b
Self-reported 2026-05-06
2 Qwen3.6 Plus 0.94 Qwen3.6 Plus
qwen-qwen3.6-plus
Self-reported 2026-05-06
3 Kimi K2-Thinking-0905 0.94 KIMI MoonshotAI: Kimi K2 Thinking
moonshotai-kimi-k2-thinking
Self-reported 2026-05-06
4 Qwen3.5-122B-A10B 0.94 Qwen3.5-122B-A10B
qwen-qwen3.5-122b-a10b
Self-reported 2026-05-06
5 Qwen3-235B-A22B-Thinking-2507 0.94 Qwen3 235B A22B Thinking 2507
qwen-qwen3-235b-a22b-thinking-2507
Self-reported 2026-05-06
6 Qwen3 VL 235B A22B Thinking 0.94 Qwen3 VL 235B A22B Thinking
qwen-qwen3-vl-235b-a22b-thinking
Self-reported 2026-05-06
7 Qwen3.6-27B 0.94 Qwen3.6 27B
qwen-qwen3.6-27b
Self-reported 2026-05-06
8 DeepSeek-R1-0528 0.93 R1 0528
deepseek-deepseek-r1-0528
Self-reported 2026-05-06
9 Qwen3.6-35B-A3B 0.93 Qwen3.6 35B A3B
qwen-qwen3.6-35b-a3b
Self-reported 2026-05-06
9 Qwen3.5-35B-A3B 0.93 Qwen3.5-35B-A3B
qwen-qwen3.5-35b-a3b
Self-reported 2026-05-06
11 Qwen3.5-27B 0.93 Qwen3.5-27B
qwen-qwen3.5-27b
Self-reported 2026-05-06
12 Qwen3-235B-A22B-Instruct-2507 0.93 Qwen3 235B A22B Instruct 2507
qwen-qwen3-235b-a22b-2507
Self-reported 2026-05-06
13 Kimi K2-Instruct-0905 0.93 KIMI MoonshotAI: Kimi K2 0905
moonshotai-kimi-k2-0905
Self-reported 2026-05-06
13 Kimi K2 Instruct 0.93 KIMI MoonshotAI: Kimi K2 0711
moonshotai-kimi-k2
Self-reported 2026-05-06
15 Qwen3-Next-80B-A3B-Thinking 0.93 Qwen3 Next 80B A3B Thinking
qwen-qwen3-next-80b-a3b-thinking
Self-reported 2026-05-06
16 Qwen3 VL 235B A22B Instruct 0.92 Qwen3 VL 235B A22B Instruct
qwen-qwen3-vl-235b-a22b-instruct
Self-reported 2026-05-06
17 Qwen3 VL 32B Thinking 0.92 Self-reported 2026-05-06
18 DeepSeek-V3.1 0.92 DeepSeek V3.1
deepseek-deepseek-chat-v3.1
Self-reported 2026-05-06
19 Qwen3.5-9B 0.91 Qwen3.5-9B
qwen-qwen3.5-9b
Self-reported 2026-05-06
20 Qwen3-Next-80B-A3B-Instruct 0.91 Qwen3 Next 80B A3B Instruct
qwen-qwen3-next-80b-a3b-instruct
Self-reported 2026-05-06
20 Qwen3 VL 30B A3B Thinking 0.91 Qwen3 VL 30B A3B Thinking
qwen-qwen3-vl-30b-a3b-thinking
Self-reported 2026-05-06
22 Qwen3 VL 32B Instruct 0.90 Qwen3 VL 32B Instruct
qwen-qwen3-vl-32b-instruct
Self-reported 2026-05-06
23 LongCat-Flash-Thinking 0.89 Self-reported 2026-05-06
24 DeepSeek-V3 0.89 DeepSeek V3
deepseek-deepseek-chat
Self-reported 2026-05-06
25 Qwen3 VL 8B Thinking 0.89 Qwen3 VL 8B Thinking
qwen-qwen3-vl-8b-thinking
Self-reported 2026-05-06
25 Qwen3.5-4B 0.89 Self-reported 2026-05-06
27 Qwen3 VL 30B A3B Instruct 0.88 Qwen3 VL 30B A3B Instruct
qwen-qwen3-vl-30b-a3b-instruct
Self-reported 2026-05-06
28 Qwen3 235B A22B 0.87 Qwen3 235B A22B
qwen-qwen3-235b-a22b
Self-reported 2026-05-06
29 Qwen2.5 72B Instruct 0.87 Qwen2.5 72B Instruct
qwen-qwen-2.5-72b-instruct
Self-reported 2026-05-06
30 Qwen3 VL 4B Thinking 0.86 Self-reported 2026-05-06
31 Qwen3 VL 8B Instruct 0.85 Qwen3 VL 8B Instruct
qwen-qwen3-vl-8b-instruct
Self-reported 2026-05-06
32 Qwen2.5 32B Instruct 0.84 Self-reported 2026-05-06
33 Ministral 3 (14B Base 2512) 0.82 Self-reported 2026-05-06
33 Mistral Large 3 0.82 Self-reported 2026-05-06
35 Qwen3 VL 4B Instruct 0.81 Self-reported 2026-05-06
36 Qwen2.5 14B Instruct 0.80 Self-reported 2026-05-06
37 Qwen3.5-2B 0.80 Self-reported 2026-05-06
38 Ministral 3 (8B Base 2512) 0.79 Self-reported 2026-05-06
39 Qwen2.5-Coder 32B Instruct 0.78 Qwen2.5 Coder 32B Instruct
qwen-qwen-2.5-coder-32b-instruct
Self-reported 2026-05-06
40 Qwen2.5 7B Instruct 0.75 Qwen2.5 7B Instruct
qwen-qwen-2.5-7b-instruct
Self-reported 2026-05-06
41 Ministral 3 (3B Base 2512) 0.73 Self-reported 2026-05-06
42 Qwen2.5-Omni-7B 0.71 Self-reported 2026-05-06
43 Qwen2.5-Coder 7B Instruct 0.67 Self-reported 2026-05-06
44 Qwen3.5-0.8B 0.59 Self-reported 2026-05-06
45 ERNIE 4.5 0.43 ERNIE 4.5 300B A47B
baidu-ernie-4.5-300b-a47b
Self-reported 2026-05-06