FINAL Bench Metacognitive
Functional metacognitive reasoning benchmark evaluating whether language models can identify uncertainty, detect inconsistencies, recover from errors, and correct their own reasoning.
9rows
metacognitive_scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Metacognitive Score, Baseline Score, Metacognitive Delta, Metacognitive Problem Quality, Baseline Problem Quality, Metacognitive Metacognitive Awareness, Baseline Metacognitive Awareness, Metacognitive Error Recovery, Baseline Error Recovery, Metacognitive Inconsistency Detection, Baseline Inconsistency Detection, Metacognitive Functional Correction, Baseline Functional Correction
| Rank | Subject | Metacognitive Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Kimi K2.5 | 78.54 | MoonshotAI: Kimi K2.5 moonshotai-kimi-k2.5 | Imported | 2026-05-06 |
| 2 | Gemini 3 Pro | 77.08 | Gemini 3 google-gemini-3 | Imported | 2026-05-06 |
| 3 | GPT-5.2 | 76.50 | GPT-5.2 openai-gpt-5.2 | Imported | 2026-05-06 |
| 4 | GLM-5 | 76.38 | GLM 5 z-ai-glm-5 | Imported | 2026-05-06 |
| 5 | Claude Opus 4.6 | 76.17 | Claude Opus 4.6 anthropic-claude-opus-4.6 | Imported | 2026-05-06 |
| 6 | MiniMax-M1-2.5 | 74.04 | — | Imported | 2026-05-06 |
| 7 | GPT-OSS-120B | 73.33 | gpt-oss-120b openai-gpt-oss-120b | Imported | 2026-05-06 |
| 8 | DeepSeek-V3.2 | 73.08 | DeepSeek V3.2 deepseek-deepseek-v3.2 | Imported | 2026-05-06 |
| 9 | GLM-4.7P | 71.42 | — | Imported | 2026-05-06 |
No matching rows.