Clembench Text v3.0
Clembench evaluates chat-optimized language models as conversational agents through language games; this v3.0 text leaderboard tracks Clemscore, played percentage, quality score, and task-level game metrics.
Metadata
Metrics
clemscore, adventuregame % Played, adventuregame Quality Score, all Average % Played, all Average Quality Score, clean_up % Played, clean_up Quality Score, codenames % Played, codenames Quality Score, dond % Played, dond Quality Score, guesswhat % Played, guesswhat Quality Score, hot_air_balloon % Played, hot_air_balloon Quality Score, imagegame % Played, imagegame Quality Score, matchit_ascii % Played, matchit_ascii Quality Score, privateshared % Played, privateshared Quality Score, referencegame % Played, referencegame Quality Score, taboo % Played, taboo Quality Score, textmapworld % Played, textmapworld Quality Score, textmapworld_graphreasoning % Played, textmapworld_graphreasoning Quality Score, textmapworld_specificroom % Played, textmapworld_specificroom Quality Score, wordle % Played, wordle Quality Score, wordle_withclue % Played, wordle_withclue Quality Score, wordle_withcritic % Played, wordle_withcritic Quality Score
| Rank | Subject | clemscore | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | claude-sonnet-4-5-azure-high-t1.0 | 90.10 | — | Imported | 2026-05-06 |
| 2 | claude-sonnet-4-5-20250929-t1.0 | 87.42 | — | Imported | 2026-05-06 |
| 3 | claude-sonnet-4-5-azure-low-t1.0 | 86.01 | — | Imported | 2026-05-06 |
| 4 | gpt-5.2-azure-high-t1.0 | 84.19 | GPT-5.2 openai-gpt-5.2 | Imported | 2026-05-06 |
| 5 | gemini-3-flash-t1.0 | 84.03 | — | Imported | 2026-05-06 |
| 6 | gpt-5.2-2025-12-11-t1.0 | 81.66 | GPT-5.2 openai-gpt-5.2 | Imported | 2026-05-06 |
| 7 | gpt-5.2-azure-medium-t1.0 | 79.61 | GPT-5.2 openai-gpt-5.2 | Imported | 2026-05-06 |
| 8 | glm-4.7-t1.0 | 78.05 | — | Imported | 2026-05-06 |
| 9 | kimi-k2-thinking-t1.0 | 77.79 | MoonshotAI: Kimi K2 Thinking moonshotai-kimi-k2-thinking | Imported | 2026-05-06 |
| 10 | gpt-5.2-azure-minimal-t1.0 | 74.27 | GPT-5.2 openai-gpt-5.2 | Imported | 2026-05-06 |
| 11 | glm-4.6-t1.0 | 63.91 | — | Imported | 2026-05-06 |
| 12 | kimi-k2.5-without-reasoning-t1.0 | 60.28 | — | Imported | 2026-05-06 |
| 13 | qwen3-max-t1.0 | 59.66 | — | Imported | 2026-05-06 |
| 14 | deepseek-v3.2-t1.0 | 59.61 | — | Imported | 2026-05-06 |
| 15 | glm-5-without-reasoning-t1.0 | 58.68 | — | Imported | 2026-05-06 |
| 16 | minimax-m2.5-t1.0 | 55.68 | — | Imported | 2026-05-06 |
| 17 | deepseek-v3.2-without-reasoning-t1.0 | 52.94 | — | Imported | 2026-05-06 |
| 18 | Llama-3.3-70B-Instruct | 50 | Llama 3.3 70B Instruct meta-llama-llama-3.3-70b-instruct | Imported | 2026-05-06 |
| 19 | Qwen2.5-72B-Instruct | 48.07 | Qwen2.5 72B Instruct qwen-qwen-2.5-72b-instruct | Imported | 2026-05-06 |
| 20 | Llama-3.1-70B-Instruct | 46.80 | Llama 3.1 70B Instruct meta-llama-llama-3.1-70b-instruct | Imported | 2026-05-06 |
| 21 | Qwen3-Next-80B-A3B-Instruct | 45.24 | Qwen3 Next 80B A3B Instruct qwen-qwen3-next-80b-a3b-instruct | Imported | 2026-05-06 |
| 22 | mistral-3-large-2512-t1.0 | 44.79 | — | Imported | 2026-05-06 |
| 23 | gpt-oss-20b-t1.0 | 41.57 | — | Imported | 2026-05-06 |
| 24 | gpt-oss-120b-t1.0 | 35.96 | — | Imported | 2026-05-06 |
| 25 | Qwen2.5-Coder-32B-Instruct | 35.32 | Qwen2.5 Coder 32B Instruct qwen-qwen-2.5-coder-32b-instruct | Imported | 2026-05-06 |
| 26 | Ministral-3-14B-Reasoning-2512-nothink | 26.66 | — | Imported | 2026-05-06 |
| 27 | Llama-3.1-8B-Instruct | 25.28 | Llama 3.1 8B Instruct meta-llama-llama-3.1-8b-instruct | Imported | 2026-05-06 |
| 28 | Aya-Expanse-32B | 16.90 | — | Imported | 2026-05-06 |
| 29 | Olmo-3.1-32B-Instruct | 14.63 | Olmo 3.1 32B Instruct allenai-olmo-3.1-32b-instruct | Imported | 2026-05-06 |
| 30 | EuroLLM-22B-Instruct-2512 | 13.90 | — | Imported | 2026-05-06 |
| 31 | Teuken-7B-Instruct-v0.4 | 7.02 | — | Imported | 2026-05-06 |
No matching rows.