Clembench Text v3.0 | BenchmarkList

Metadata

ID: clembench_text_v3
Category: Agentic
Release: 2024-01-01
Source: Source page
Snapshot: Snapshot source

Metrics

clemscore, adventuregame % Played, adventuregame Quality Score, all Average % Played, all Average Quality Score, clean_up % Played, clean_up Quality Score, codenames % Played, codenames Quality Score, dond % Played, dond Quality Score, guesswhat % Played, guesswhat Quality Score, hot_air_balloon % Played, hot_air_balloon Quality Score, imagegame % Played, imagegame Quality Score, matchit_ascii % Played, matchit_ascii Quality Score, privateshared % Played, privateshared Quality Score, referencegame % Played, referencegame Quality Score, taboo % Played, taboo Quality Score, textmapworld % Played, textmapworld Quality Score, textmapworld_graphreasoning % Played, textmapworld_graphreasoning Quality Score, textmapworld_specificroom % Played, textmapworld_specificroom Quality Score, wordle % Played, wordle Quality Score, wordle_withclue % Played, wordle_withclue Quality Score, wordle_withcritic % Played, wordle_withcritic Quality Score

Rank	Subject	clemscore	Model Match	Provenance	Sampled
1	claude-sonnet-4-5-azure-high-t1.0	90.10	—	Imported	2026-05-06
2	claude-sonnet-4-5-20250929-t1.0	87.42	—	Imported	2026-05-06
3	claude-sonnet-4-5-azure-low-t1.0	86.01	—	Imported	2026-05-06
4	gpt-5.2-azure-high-t1.0	84.19	GPT-5.2 openai-gpt-5.2	Imported	2026-05-06
5	gemini-3-flash-t1.0	84.03	—	Imported	2026-05-06
6	gpt-5.2-2025-12-11-t1.0	81.66	GPT-5.2 openai-gpt-5.2	Imported	2026-05-06
7	gpt-5.2-azure-medium-t1.0	79.61	GPT-5.2 openai-gpt-5.2	Imported	2026-05-06
8	glm-4.7-t1.0	78.05	—	Imported	2026-05-06
9	kimi-k2-thinking-t1.0	77.79	KIMI MoonshotAI: Kimi K2 Thinking moonshotai-kimi-k2-thinking	Imported	2026-05-06
10	gpt-5.2-azure-minimal-t1.0	74.27	GPT-5.2 openai-gpt-5.2	Imported	2026-05-06
11	glm-4.6-t1.0	63.91	—	Imported	2026-05-06
12	kimi-k2.5-without-reasoning-t1.0	60.28	—	Imported	2026-05-06
13	qwen3-max-t1.0	59.66	—	Imported	2026-05-06
14	deepseek-v3.2-t1.0	59.61	—	Imported	2026-05-06
15	glm-5-without-reasoning-t1.0	58.68	—	Imported	2026-05-06
16	minimax-m2.5-t1.0	55.68	—	Imported	2026-05-06
17	deepseek-v3.2-without-reasoning-t1.0	52.94	—	Imported	2026-05-06
18	Llama-3.3-70B-Instruct	50	Llama 3.3 70B Instruct meta-llama-llama-3.3-70b-instruct	Imported	2026-05-06
19	Qwen2.5-72B-Instruct	48.07	Qwen2.5 72B Instruct qwen-qwen-2.5-72b-instruct	Imported	2026-05-06
20	Llama-3.1-70B-Instruct	46.80	Llama 3.1 70B Instruct meta-llama-llama-3.1-70b-instruct	Imported	2026-05-06
21	Qwen3-Next-80B-A3B-Instruct	45.24	Qwen3 Next 80B A3B Instruct qwen-qwen3-next-80b-a3b-instruct	Imported	2026-05-06
22	mistral-3-large-2512-t1.0	44.79	—	Imported	2026-05-06
23	gpt-oss-20b-t1.0	41.57	—	Imported	2026-05-06
24	gpt-oss-120b-t1.0	35.96	—	Imported	2026-05-06
25	Qwen2.5-Coder-32B-Instruct	35.32	Qwen2.5 Coder 32B Instruct qwen-qwen-2.5-coder-32b-instruct	Imported	2026-05-06
26	Ministral-3-14B-Reasoning-2512-nothink	26.66	—	Imported	2026-05-06
27	Llama-3.1-8B-Instruct	25.28	Llama 3.1 8B Instruct meta-llama-llama-3.1-8b-instruct	Imported	2026-05-06
28	Aya-Expanse-32B	16.90	—	Imported	2026-05-06
29	Olmo-3.1-32B-Instruct	14.63	OLMO Olmo 3.1 32B Instruct allenai-olmo-3.1-32b-instruct	Imported	2026-05-06
30	EuroLLM-22B-Instruct-2512	13.90	—	Imported	2026-05-06
31	Teuken-7B-Instruct-v0.4	7.02	—	Imported	2026-05-06

Metadata

Metrics

Latest Results