MultiChallenge

MultiChallenge evaluates frontier LLMs on realistic multi-turn conversations, assessing instruction retention, inference memory, and self-coherence.

29rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Confidence Interval Upper, Max Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 Muse Spark 75.52 Imported 2026-05-06
1 gemini-3.1-pro-preview 71.37 Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-06
1 gpt-5.4-pro-2026-03-05 69.23 GPT-5.4 Pro
openai-gpt-5.4-pro
Imported 2026-05-06
3 gemini-3-pro-preview 65.67 Gemini 3
google-gemini-3
Imported 2026-05-06
4 gpt-5.1-2025-11-13-thinking 63.41 GPT-5.1
openai-gpt-5.1
Imported 2026-05-06
4 gpt-5-thinking 63.19 GPT-5
openai-gpt-5
Imported 2026-05-06
4 o3-pro-2025-06-10-reasoning-high 62.40 Imported 2026-05-06
5 kimi-k2.5 61.39 KIMI MoonshotAI: Kimi K2.5
moonshotai-kimi-k2.5
Imported 2026-05-06
5 gpt-5-mini-thinking 58.99 GPT-5 Mini
openai-gpt-5-mini
Imported 2026-05-06
7 gemini-3.1-flash-lite-preview 60.61 Gemini 3.1 Flash Lite Preview
google-gemini-3.1-flash-lite-preview
Imported 2026-05-06
9 o3-2025-04-16-reasoning-high 56.62 Imported 2026-05-06
10 claude-opus-4-5-20251101-thinking 58.97 Claude Opus 4.5
anthropic-claude-opus-4.5
Imported 2026-05-06
10 claude-4-opus-thinking 58.62 Imported 2026-05-06
11 claude-opus-4-1-20250805-thinking 57.20 Claude Opus 4.1
anthropic-claude-opus-4.1
Imported 2026-05-06
11 claude-4-sonnet-thinking 57.11 Imported 2026-05-06
12 claude-opus-4-6 (Non-Thinking) 56.02 Claude Opus 4.6
anthropic-claude-opus-4.6
Imported 2026-05-06
12 kimi-k2-thinking 55.42 KIMI MoonshotAI: Kimi K2 Thinking
moonshotai-kimi-k2-thinking
Imported 2026-05-06
12 claude-sonnet-4-5-20250929-thinking 55.32 Claude Sonnet 4.5
anthropic-claude-sonnet-4.5
Imported 2026-05-06
15 gemini-2-5-pro 53.62 Gemini 2.5 Pro
google-gemini-2.5-pro
Imported 2026-05-06
16 claude-3-7-sonnet-thinking 51.58 Claude 3.7 Sonnet (thinking)
anthropic-claude-3.7-sonnet-thinking
Imported 2026-05-06
19 gpt-5.1-2025-11-13-instant 51.23 GPT-5.1
openai-gpt-5.1
Imported 2026-05-06
19 claude-haiku-4-5-20251001-thinking 50.49 Claude Haiku 4.5
anthropic-claude-haiku-4.5
Imported 2026-05-06
24 deepseek-v3p1 46.10 DeepSeek V3.1 Terminus
deepseek-deepseek-v3.1-terminus
Imported 2026-05-06
24 gpt-oss-120b 45.34 gpt-oss-120b
openai-gpt-oss-120b
Imported 2026-05-06
24 o4-mini-2025-04-16-reasoning-high 44.90 Imported 2026-05-06
27 qwen3-235b-a22b 41.22 Qwen3 235B A22B
qwen-qwen3-235b-a22b
Imported 2026-05-06
27 gpt-4.1 39.43 GPT-4.1
openai-gpt-4.1
Imported 2026-05-06
28 claude-opus-4-6-thinking-max 37.15 Claude Opus 4.6
anthropic-claude-opus-4.6
Imported 2026-05-06
29 gemini-2-0-flash 36.35 Gemini 2.0 Flash
google-gemini-2.0-flash
Imported 2026-05-06