MMLU (CoT)

Chain-of-Thought variant of the Massive Multitask Language Understanding benchmark, evaluating language models across 57 tasks including elementary mathematics, US history, computer science, law, and other professional and academic subjects. This version uses chain-of-thought prompting to elicit step-by-step reasoning.

3rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Normalized Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 Llama 3.1 405B Instruct 0.89 Self-reported 2026-05-06
2 Llama 3.1 70B Instruct 0.86 Llama 3.1 70B Instruct
meta-llama-llama-3.1-70b-instruct
Self-reported 2026-05-06
3 Llama 3.1 8B Instruct 0.73 Llama 3.1 8B Instruct
meta-llama-llama-3.1-8b-instruct
Self-reported 2026-05-06