FINAL Bench Metacognitive

Functional metacognitive reasoning benchmark evaluating whether language models can identify uncertainty, detect inconsistencies, recover from errors, and correct their own reasoning.

9rows
metacognitive_scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Metacognitive Score, Baseline Score, Metacognitive Delta, Metacognitive Problem Quality, Baseline Problem Quality, Metacognitive Metacognitive Awareness, Baseline Metacognitive Awareness, Metacognitive Error Recovery, Baseline Error Recovery, Metacognitive Inconsistency Detection, Baseline Inconsistency Detection, Metacognitive Functional Correction, Baseline Functional Correction

Latest Results

Rows are parsed from the public FINAL Bench leaderboard static model array. Component scores from the source are converted to percentages; headline baseline and metacognitive scores are preserved as displayed.

Rank Subject Metacognitive Score Model Match Provenance Sampled
1 Kimi K2.5 78.54 KIMI MoonshotAI: Kimi K2.5
moonshotai-kimi-k2.5
Imported 2026-05-06
2 Gemini 3 Pro 77.08 Gemini 3
google-gemini-3
Imported 2026-05-06
3 GPT-5.2 76.50 GPT-5.2
openai-gpt-5.2
Imported 2026-05-06
4 GLM-5 76.38 GLM GLM 5
z-ai-glm-5
Imported 2026-05-06
5 Claude Opus 4.6 76.17 Claude Opus 4.6
anthropic-claude-opus-4.6
Imported 2026-05-06
6 MiniMax-M1-2.5 74.04 Imported 2026-05-06
7 GPT-OSS-120B 73.33 gpt-oss-120b
openai-gpt-oss-120b
Imported 2026-05-06
8 DeepSeek-V3.2 73.08 DeepSeek V3.2
deepseek-deepseek-v3.2
Imported 2026-05-06
9 GLM-4.7P 71.42 Imported 2026-05-06