MultiChallenge
MultiChallenge evaluates frontier LLMs on realistic multi-turn conversations, assessing instruction retention, inference memory, and self-coherence.
29rows
scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Score, Confidence Interval Upper, Max Score
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Muse Spark | 75.52 | — | Imported | 2026-05-06 |
| 1 | gemini-3.1-pro-preview | 71.37 | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Imported | 2026-05-06 |
| 1 | gpt-5.4-pro-2026-03-05 | 69.23 | GPT-5.4 Pro openai-gpt-5.4-pro | Imported | 2026-05-06 |
| 3 | gemini-3-pro-preview | 65.67 | Gemini 3 google-gemini-3 | Imported | 2026-05-06 |
| 4 | gpt-5.1-2025-11-13-thinking | 63.41 | GPT-5.1 openai-gpt-5.1 | Imported | 2026-05-06 |
| 4 | gpt-5-thinking | 63.19 | GPT-5 openai-gpt-5 | Imported | 2026-05-06 |
| 4 | o3-pro-2025-06-10-reasoning-high | 62.40 | — | Imported | 2026-05-06 |
| 5 | kimi-k2.5 | 61.39 | MoonshotAI: Kimi K2.5 moonshotai-kimi-k2.5 | Imported | 2026-05-06 |
| 5 | gpt-5-mini-thinking | 58.99 | GPT-5 Mini openai-gpt-5-mini | Imported | 2026-05-06 |
| 7 | gemini-3.1-flash-lite-preview | 60.61 | Gemini 3.1 Flash Lite Preview google-gemini-3.1-flash-lite-preview | Imported | 2026-05-06 |
| 9 | o3-2025-04-16-reasoning-high | 56.62 | — | Imported | 2026-05-06 |
| 10 | claude-opus-4-5-20251101-thinking | 58.97 | Claude Opus 4.5 anthropic-claude-opus-4.5 | Imported | 2026-05-06 |
| 10 | claude-4-opus-thinking | 58.62 | — | Imported | 2026-05-06 |
| 11 | claude-opus-4-1-20250805-thinking | 57.20 | Claude Opus 4.1 anthropic-claude-opus-4.1 | Imported | 2026-05-06 |
| 11 | claude-4-sonnet-thinking | 57.11 | — | Imported | 2026-05-06 |
| 12 | claude-opus-4-6 (Non-Thinking) | 56.02 | Claude Opus 4.6 anthropic-claude-opus-4.6 | Imported | 2026-05-06 |
| 12 | kimi-k2-thinking | 55.42 | MoonshotAI: Kimi K2 Thinking moonshotai-kimi-k2-thinking | Imported | 2026-05-06 |
| 12 | claude-sonnet-4-5-20250929-thinking | 55.32 | Claude Sonnet 4.5 anthropic-claude-sonnet-4.5 | Imported | 2026-05-06 |
| 15 | gemini-2-5-pro | 53.62 | Gemini 2.5 Pro google-gemini-2.5-pro | Imported | 2026-05-06 |
| 16 | claude-3-7-sonnet-thinking | 51.58 | Claude 3.7 Sonnet (thinking) anthropic-claude-3.7-sonnet-thinking | Imported | 2026-05-06 |
| 19 | gpt-5.1-2025-11-13-instant | 51.23 | GPT-5.1 openai-gpt-5.1 | Imported | 2026-05-06 |
| 19 | claude-haiku-4-5-20251001-thinking | 50.49 | Claude Haiku 4.5 anthropic-claude-haiku-4.5 | Imported | 2026-05-06 |
| 24 | deepseek-v3p1 | 46.10 | DeepSeek V3.1 Terminus deepseek-deepseek-v3.1-terminus | Imported | 2026-05-06 |
| 24 | gpt-oss-120b | 45.34 | gpt-oss-120b openai-gpt-oss-120b | Imported | 2026-05-06 |
| 24 | o4-mini-2025-04-16-reasoning-high | 44.90 | — | Imported | 2026-05-06 |
| 27 | qwen3-235b-a22b | 41.22 | Qwen3 235B A22B qwen-qwen3-235b-a22b | Imported | 2026-05-06 |
| 27 | gpt-4.1 | 39.43 | GPT-4.1 openai-gpt-4.1 | Imported | 2026-05-06 |
| 28 | claude-opus-4-6-thinking-max | 37.15 | Claude Opus 4.6 anthropic-claude-opus-4.6 | Imported | 2026-05-06 |
| 29 | gemini-2-0-flash | 36.35 | Gemini 2.0 Flash google-gemini-2.0-flash | Imported | 2026-05-06 |
No matching rows.