Clembench Text v3.0

Clembench evaluates chat-optimized language models as conversational agents through language games; this v3.0 text leaderboard tracks Clemscore, played percentage, quality score, and task-level game metrics.

31rows
clemscoreprimary metric
2026-05-06sampled

Metadata

Metrics

clemscore, adventuregame % Played, adventuregame Quality Score, all Average % Played, all Average Quality Score, clean_up % Played, clean_up Quality Score, codenames % Played, codenames Quality Score, dond % Played, dond Quality Score, guesswhat % Played, guesswhat Quality Score, hot_air_balloon % Played, hot_air_balloon Quality Score, imagegame % Played, imagegame Quality Score, matchit_ascii % Played, matchit_ascii Quality Score, privateshared % Played, privateshared Quality Score, referencegame % Played, referencegame Quality Score, taboo % Played, taboo Quality Score, textmapworld % Played, textmapworld Quality Score, textmapworld_graphreasoning % Played, textmapworld_graphreasoning Quality Score, textmapworld_specificroom % Played, textmapworld_specificroom Quality Score, wordle % Played, wordle Quality Score, wordle_withclue % Played, wordle_withclue Quality Score, wordle_withcritic % Played, wordle_withcritic Quality Score

Latest Results

Rank Subject clemscore Model Match Provenance Sampled
1 claude-sonnet-4-5-azure-high-t1.0 90.10 Imported 2026-05-06
2 claude-sonnet-4-5-20250929-t1.0 87.42 Imported 2026-05-06
3 claude-sonnet-4-5-azure-low-t1.0 86.01 Imported 2026-05-06
4 gpt-5.2-azure-high-t1.0 84.19 GPT-5.2
openai-gpt-5.2
Imported 2026-05-06
5 gemini-3-flash-t1.0 84.03 Imported 2026-05-06
6 gpt-5.2-2025-12-11-t1.0 81.66 GPT-5.2
openai-gpt-5.2
Imported 2026-05-06
7 gpt-5.2-azure-medium-t1.0 79.61 GPT-5.2
openai-gpt-5.2
Imported 2026-05-06
8 glm-4.7-t1.0 78.05 Imported 2026-05-06
9 kimi-k2-thinking-t1.0 77.79 KIMI MoonshotAI: Kimi K2 Thinking
moonshotai-kimi-k2-thinking
Imported 2026-05-06
10 gpt-5.2-azure-minimal-t1.0 74.27 GPT-5.2
openai-gpt-5.2
Imported 2026-05-06
11 glm-4.6-t1.0 63.91 Imported 2026-05-06
12 kimi-k2.5-without-reasoning-t1.0 60.28 Imported 2026-05-06
13 qwen3-max-t1.0 59.66 Imported 2026-05-06
14 deepseek-v3.2-t1.0 59.61 Imported 2026-05-06
15 glm-5-without-reasoning-t1.0 58.68 Imported 2026-05-06
16 minimax-m2.5-t1.0 55.68 Imported 2026-05-06
17 deepseek-v3.2-without-reasoning-t1.0 52.94 Imported 2026-05-06
18 Llama-3.3-70B-Instruct 50 Llama 3.3 70B Instruct
meta-llama-llama-3.3-70b-instruct
Imported 2026-05-06
19 Qwen2.5-72B-Instruct 48.07 Qwen2.5 72B Instruct
qwen-qwen-2.5-72b-instruct
Imported 2026-05-06
20 Llama-3.1-70B-Instruct 46.80 Llama 3.1 70B Instruct
meta-llama-llama-3.1-70b-instruct
Imported 2026-05-06
21 Qwen3-Next-80B-A3B-Instruct 45.24 Qwen3 Next 80B A3B Instruct
qwen-qwen3-next-80b-a3b-instruct
Imported 2026-05-06
22 mistral-3-large-2512-t1.0 44.79 Imported 2026-05-06
23 gpt-oss-20b-t1.0 41.57 Imported 2026-05-06
24 gpt-oss-120b-t1.0 35.96 Imported 2026-05-06
25 Qwen2.5-Coder-32B-Instruct 35.32 Qwen2.5 Coder 32B Instruct
qwen-qwen-2.5-coder-32b-instruct
Imported 2026-05-06
26 Ministral-3-14B-Reasoning-2512-nothink 26.66 Imported 2026-05-06
27 Llama-3.1-8B-Instruct 25.28 Llama 3.1 8B Instruct
meta-llama-llama-3.1-8b-instruct
Imported 2026-05-06
28 Aya-Expanse-32B 16.90 Imported 2026-05-06
29 Olmo-3.1-32B-Instruct 14.63 OLMO Olmo 3.1 32B Instruct
allenai-olmo-3.1-32b-instruct
Imported 2026-05-06
30 EuroLLM-22B-Instruct-2512 13.90 Imported 2026-05-06
31 Teuken-7B-Instruct-v0.4 7.02 Imported 2026-05-06