MMLU-Pro | BenchmarkList

Metadata

ID: mmlu_pro
Category: Intelligence
Release: 2024-06-03
Source: Source page
Snapshot: Snapshot source
Post: Announcement post

Metrics

Accuracy

Showing 2 latest source slices.

Rank	Subject	Accuracy	Model Match	Provenance	Sampled
1	Claude Opus 4.6 Max	89.7%	Claude Opus 4.6 anthropic-claude-opus-4.6	Self-reported	2026-05-28
2	Qwen3.7 Max	89.6%	Qwen3.7 Max qwen-qwen3.7-max	Self-reported	2026-05-28
3	Qwen3.6 Plus	88.5%	Qwen3.6 Plus qwen-qwen3.6-plus	Self-reported	2026-05-28
4	DeepSeek V4 Pro Max	87.5%	DeepSeek V4 Pro deepseek-deepseek-v4-pro	Self-reported	2026-05-28
5	Kimi K2.6 Thinking	87.1%	KIMI MoonshotAI: Kimi K2.6 moonshotai-kimi-k2.6	Self-reported	2026-05-28
6	GLM-5.1 Thinking	86.3%	GLM GLM 5.1 z-ai-glm-5.1	Self-reported	2026-05-28
1	Gemini 3 Pro Preview (high)	89.8%	Gemini 3 google-gemini-3	Imported	2026-05-11
2	Claude Opus 4.5 (Reasoning)	89.5%	Claude Opus 4.5 anthropic-claude-opus-4.5	Imported	2026-05-11
3	Gemini 3 Pro Preview (low)	89.5%	Gemini 3 google-gemini-3	Imported	2026-05-11
4	Gemini 3 Flash Preview (Reasoning)	89%	Gemini 3 Flash Preview google-gemini-3-flash-preview	Imported	2026-05-11
5	Claude Opus 4.5 (Non-reasoning)	88.9%	Claude Opus 4.5 anthropic-claude-opus-4.5	Imported	2026-05-11
6	Gemini 3 Flash Preview (Non-reasoning)	88.2%	Gemini 3 Flash Preview google-gemini-3-flash-preview	Imported	2026-05-11
7	Claude 4.1 Opus (Reasoning)	88%	—	Imported	2026-05-11
8	Claude 4.5 Sonnet (Reasoning)	87.5%	—	Imported	2026-05-11
9	MiniMax-M2.1	87.5%	MiniMax M2.1 minimax-minimax-m2.1	Imported	2026-05-11
10	GPT-5.2 (xhigh)	87.4%	GPT-5.2 openai-gpt-5.2	Imported	2026-05-11
11	Claude 4 Opus (Reasoning)	87.3%	—	Imported	2026-05-11
12	GPT-5 (high)	87.1%	GPT-5 openai-gpt-5	Imported	2026-05-11
13	GPT-5.1 (high)	87%	GPT-5.1 openai-gpt-5.1	Imported	2026-05-11
14	GPT-5 (medium)	86.7%	GPT-5 openai-gpt-5	Imported	2026-05-11
15	Grok 4	86.6%	GROK Grok 4 x-ai-grok-4	Imported	2026-05-11
16	GPT-5 Codex (high)	86.5%	GPT-5 Codex openai-gpt-5-codex	Imported	2026-05-11
17	DeepSeek V3.2 Speciale	86.3%	DeepSeek V3.2 Speciale deepseek-deepseek-v3.2-speciale	Imported	2026-05-11
18	DeepSeek V3.2 (Reasoning)	86.2%	DeepSeek V3.2 deepseek-deepseek-v3.2	Imported	2026-05-11
19	Gemini 2.5 Pro	86.2%	Gemini 2.5 Pro google-gemini-2.5-pro	Imported	2026-05-11
20	Claude 4 Opus (Non-reasoning)	86%	—	Imported	2026-05-11
21	Claude 4.5 Sonnet (Non-reasoning)	86%	—	Imported	2026-05-11
22	GPT-5 (low)	86%	GPT-5 openai-gpt-5	Imported	2026-05-11
23	GPT-5.1 Codex (high)	86%	GPT-5.1-Codex openai-gpt-5.1-codex	Imported	2026-05-11
24	GPT-5.2 (medium)	85.9%	GPT-5.2 openai-gpt-5.2	Imported	2026-05-11
25	Gemini 2.5 Pro Preview (Mar' 25)	85.8%	Gemini 2.5 Pro Preview 06-05 google-gemini-2.5-pro-preview	Imported	2026-05-11
26	GLM-4.7 (Reasoning)	85.6%	GLM GLM 4.7 z-ai-glm-4.7	Imported	2026-05-11
27	Doubao Seed Code	85.4%	—	Imported	2026-05-11
28	Grok 4.1 Fast (Reasoning)	85.4%	GROK Grok 4.1 Fast x-ai-grok-4.1-fast	Imported	2026-05-11
29	o3	85.3%	o3 openai-o3	Imported	2026-05-11
30	DeepSeek V3.1 (Reasoning)	85.1%	DeepSeek V3.1 deepseek-deepseek-chat-v3.1	Imported	2026-05-11
31	DeepSeek V3.1 Terminus (Reasoning)	85.1%	DeepSeek V3.1 Terminus deepseek-deepseek-v3.1-terminus	Imported	2026-05-11
32	DeepSeek V3.2 Exp (Reasoning)	85%	DeepSeek V3.2 Exp deepseek-deepseek-v3.2-exp	Imported	2026-05-11
33	Grok 4 Fast (Reasoning)	85%	GROK Grok 4 Fast x-ai-grok-4-fast	Imported	2026-05-11
34	Cogito v2.1 (Reasoning)	84.9%	—	Imported	2026-05-11
35	DeepSeek R1 0528 (May '25)	84.9%	R1 deepseek-r1	Imported	2026-05-11
36	Kimi K2 Thinking	84.8%	KIMI MoonshotAI: Kimi K2 Thinking moonshotai-kimi-k2-thinking	Imported	2026-05-11
37	DeepSeek R1 (Jan '25)	84.4%	R1 deepseek-r1	Imported	2026-05-11
38	MiMo-V2-Flash (Reasoning)	84.3%	MiMo-V2-Flash xiaomi-mimo-v2-flash	Imported	2026-05-11
39	Qwen3 235B A22B 2507 (Reasoning)	84.3%	Qwen3 235B A22B Instruct 2507 qwen-qwen3-235b-a22b-2507	Imported	2026-05-11
40	Claude 4 Sonnet (Reasoning)	84.2%	—	Imported	2026-05-11
41	Gemini 2.5 Flash Preview (Sep '25) (Reasoning)	84.2%	—	Imported	2026-05-11
42	o1	84.1%	o1 openai-o1	Imported	2026-05-11
43	Qwen3 Max	84.1%	Qwen3 Max qwen-qwen3-max	Imported	2026-05-11
44	K-EXAONE (Reasoning)	83.8%	—	Imported	2026-05-11
45	Qwen3 Max (Preview)	83.8%	Qwen3 Max qwen-qwen3-max	Imported	2026-05-11
46	Claude 3.7 Sonnet (Reasoning)	83.7%	Claude 3.7 Sonnet (thinking) anthropic-claude-3.7-sonnet-thinking	Imported	2026-05-11
47	Claude 4 Sonnet (Non-reasoning)	83.7%	—	Imported	2026-05-11
48	DeepSeek V3.2 (Non-reasoning)	83.7%	DeepSeek V3.2 deepseek-deepseek-v3.2	Imported	2026-05-11
49	Gemini 2.5 Pro Preview (May' 25)	83.7%	Gemini 2.5 Pro Preview 06-05 google-gemini-2.5-pro-preview	Imported	2026-05-11
50	GPT-5 mini (high)	83.7%	GPT-5 Mini openai-gpt-5-mini	Imported	2026-05-11
51	DeepSeek V3.1 Terminus (Non-reasoning)	83.6%	DeepSeek V3.1 Terminus deepseek-deepseek-v3.1-terminus	Imported	2026-05-11
52	DeepSeek V3.2 Exp (Non-reasoning)	83.6%	DeepSeek V3.2 deepseek-deepseek-v3.2	Imported	2026-05-11
53	Gemini 2.5 Flash Preview (Sep '25) (Non-reasoning)	83.6%	—	Imported	2026-05-11
54	Qwen3 VL 235B A22B (Reasoning)	83.6%	—	Imported	2026-05-11
55	GLM-4.5 (Reasoning)	83.5%	GLM GLM 4.5 z-ai-glm-4.5	Imported	2026-05-11
56	DeepSeek V3.1 (Non-reasoning)	83.3%	DeepSeek V3.1 deepseek-deepseek-chat-v3.1	Imported	2026-05-11
57	Gemini 2.5 Flash (Reasoning)	83.2%	Gemini 2.5 Flash google-gemini-2.5-flash	Imported	2026-05-11
58	o4-mini (high)	83.2%	o4 Mini openai-o4-mini	Imported	2026-05-11
59	ERNIE 5.0 Thinking Preview	83%	—	Imported	2026-05-11
60	Nova 2.0 Pro Preview (medium)	83%	—	Imported	2026-05-11
61	GLM-4.6 (Reasoning)	82.9%	GLM GLM 4.6 z-ai-glm-4.6	Imported	2026-05-11
62	Hermes 4 - Llama-3.1 405B (Reasoning)	82.9%	—	Imported	2026-05-11
63	GPT-5 mini (medium)	82.8%	GPT-5 Mini openai-gpt-5-mini	Imported	2026-05-11
64	Grok 3 mini Reasoning (high)	82.8%	—	Imported	2026-05-11
65	Qwen3 235B A22B (Reasoning)	82.8%	Qwen3 235B A22B qwen-qwen3-235b-a22b	Imported	2026-05-11
66	Qwen3 235B A22B 2507 Instruct	82.8%	Qwen3 235B A22B Instruct 2507 qwen-qwen3-235b-a22b-2507	Imported	2026-05-11
67	Llama 3.1 Nemotron Ultra 253B v1 (Reasoning)	82.5%	—	Imported	2026-05-11
68	Kimi K2	82.4%	KIMI MoonshotAI: Kimi K2 0711 moonshotai-kimi-k2	Imported	2026-05-11
69	Qwen3 Max Thinking (Preview)	82.4%	Qwen3 Max Thinking qwen-qwen3-max-thinking	Imported	2026-05-11
70	Qwen3 Next 80B A3B (Reasoning)	82.4%	—	Imported	2026-05-11
71	Qwen3 VL 235B A22B Instruct	82.3%	Qwen3 VL 235B A22B Instruct qwen-qwen3-vl-235b-a22b-instruct	Imported	2026-05-11
72	INTELLECT-3	82.2%	PI INTELLECT-3 prime-intellect-intellect-3	Imported	2026-05-11
73	Ling-1T	82.2%	—	Imported	2026-05-11
74	Nova 2.0 Pro Preview (low)	82.2%	—	Imported	2026-05-11
75	GPT-5 (ChatGPT)	82%	GPT-5 openai-gpt-5	Imported	2026-05-11
76	GPT-5.1 Codex mini (high)	82%	GPT-5.1-Codex-Mini openai-gpt-5.1-codex-mini	Imported	2026-05-11
77	MiniMax-M2	82%	MiniMax M2 minimax-minimax-m2	Imported	2026-05-11
78	DeepSeek V3 0324	81.9%	DeepSeek V3 0324 deepseek-deepseek-chat-v3-0324	Imported	2026-05-11
79	Kimi K2 0905	81.9%	KIMI MoonshotAI: Kimi K2 0905 moonshotai-kimi-k2-0905	Imported	2026-05-11
80	Qwen3 Next 80B A3B Instruct	81.9%	Qwen3 Next 80B A3B Instruct qwen-qwen3-next-80b-a3b-instruct	Imported	2026-05-11
81	EXAONE 4.0 32B (Reasoning)	81.8%	—	Imported	2026-05-11
82	Nova 2.0 Lite (high)	81.8%	—	Imported	2026-05-11
83	Qwen3 VL 32B (Reasoning)	81.8%	—	Imported	2026-05-11
84	MiniMax M1 80k	81.6%	—	Imported	2026-05-11
85	GLM-4.5-Air	81.5%	GLM GLM 4.5 Air z-ai-glm-4.5-air	Imported	2026-05-11
86	Magistral Medium 1.2	81.5%	—	Imported	2026-05-11
87	Seed-OSS-36B-Instruct	81.5%	—	Imported	2026-05-11
88	GPT-5.2 (Non-reasoning)	81.4%	GPT-5.2 openai-gpt-5.2	Imported	2026-05-11
89	Llama Nemotron Super 49B v1.5 (Reasoning)	81.4%	—	Imported	2026-05-11
90	KAT-Coder-Pro V1	81.3%	—	Imported	2026-05-11
91	Mi:dm K 2.5 Pro Preview	81.3%	—	Imported	2026-05-11
92	Nova 2.0 Lite (medium)	81.3%	—	Imported	2026-05-11
93	Hermes 4 - Llama-3.1 70B (Reasoning)	81.1%	—	Imported	2026-05-11
94	K-EXAONE (Non-reasoning)	81%	—	Imported	2026-05-11
95	Gemini 2.5 Flash (Non-reasoning)	80.9%	Gemini 2.5 Flash google-gemini-2.5-flash	Imported	2026-05-11
96	Llama 4 Maverick	80.9%	Llama 4 Maverick meta-llama-4-maverick	Imported	2026-05-11
97	Mi:dm K 2.5 Pro	80.9%	—	Imported	2026-05-11
98	Nova 2.0 Omni (medium)	80.9%	—	Imported	2026-05-11
99	Gemini 2.5 Flash-Lite Preview (Sep '25) (Reasoning)	80.8%	—	Imported	2026-05-11
100	gpt-oss-120B (high)	80.8%	gpt-oss-120b openai-gpt-oss-120b	Imported	2026-05-11
101	MiniMax M1 40k	80.8%	—	Imported	2026-05-11
102	Mistral Large 3	80.7%	—	Imported	2026-05-11
103	Qwen3 VL 30B A3B (Reasoning)	80.7%	—	Imported	2026-05-11
104	GPT-4.1	80.6%	GPT-4.1 openai-gpt-4.1	Imported	2026-05-11
105	GPT-5 (minimal)	80.6%	GPT-5 openai-gpt-5	Imported	2026-05-11
106	Ring-1T	80.6%	—	Imported	2026-05-11
107	Gemini 2.0 Pro Experimental (Feb '25)	80.5%	—	Imported	2026-05-11
108	Qwen3 30B A3B 2507 (Reasoning)	80.5%	—	Imported	2026-05-11
109	Solar Pro 2 (Reasoning)	80.5%	—	Imported	2026-05-11
110	Claude 3.7 Sonnet (Non-reasoning)	80.3%	Claude 3.7 Sonnet anthropic-claude-3.7-sonnet	Imported	2026-05-11
111	GPT-4o (March 2025, chatgpt-4o-latest)	80.3%	GPT-4o openai-gpt-4o	Imported	2026-05-11
112	o3-mini (high)	80.2%	o3 Mini High openai-o3-mini-high	Imported	2026-05-11
113	GPT-5.1 (Non-reasoning)	80.1%	GPT-5.1 openai-gpt-5.1	Imported	2026-05-11
114	Claude 4.5 Haiku (Non-reasoning)	80%	—	Imported	2026-05-11
115	Gemini 2.5 Flash Preview (Reasoning)	80%	—	Imported	2026-05-11
116	GLM-4.6V (Reasoning)	79.9%	GLM GLM 4.6V z-ai-glm-4.6v	Imported	2026-05-11
117	Grok 3	79.9%	GROK Grok 3 xaigrok-3	Imported	2026-05-11
118	Gemini 2.0 Flash Thinking Experimental (Jan '25)	79.8%	—	Imported	2026-05-11
119	Nova 2.0 Omni (low)	79.8%	—	Imported	2026-05-11
120	Qwen3 32B (Reasoning)	79.8%	Qwen3 32B qwen-qwen3-32b	Imported	2026-05-11
121	Gemini 2.5 Flash-Lite Preview (Sep '25) (Non-reasoning)	79.6%	Gemini 2.5 Flash Lite Preview 09-2025 google-gemini-2.5-flash-lite-preview-09-2025	Imported	2026-05-11
122	Motif-2-12.7B-Reasoning	79.6%	—	Imported	2026-05-11
123	DeepSeek R1 Distill Llama 70B	79.5%	R1 Distill Llama 70B deepseek-deepseek-r1-distill-llama-70b	Imported	2026-05-11
124	GLM-4.7 (Non-reasoning)	79.4%	GLM GLM 4.7 z-ai-glm-4.7	Imported	2026-05-11
125	NVIDIA Nemotron 3 Nano 30B A3B (Reasoning)	79.4%	Nemotron 3 Nano 30B A3B nvidia-nemotron-3-nano-30b-a3b	Imported	2026-05-11
126	Grok Code Fast 1	79.3%	GROK Grok Code Fast 1 x-ai-grok-code-fast-1	Imported	2026-05-11
127	Ring-flash-2.0	79.3%	—	Imported	2026-05-11
128	Qwen3 Omni 30B A3B (Reasoning)	79.2%	—	Imported	2026-05-11
129	o3-mini	79.1%	o3-mini openai-o3-mini	Imported	2026-05-11
130	Qwen3 VL 32B Instruct	79.1%	Qwen3 VL 32B Instruct qwen-qwen3-vl-32b-instruct	Imported	2026-05-11
131	Apriel-v1.6-15B-Thinker	79%	—	Imported	2026-05-11
132	GLM-4.5V (Reasoning)	78.8%	GLM GLM 4.5V z-ai-glm-4.5v	Imported	2026-05-11
133	Nova 2.0 Lite (low)	78.8%	—	Imported	2026-05-11
134	Qwen3 Coder 480B A35B Instruct	78.8%	Qwen3 Coder 480B A35B qwen-qwen3-coder	Imported	2026-05-11
135	K2-V2 (high)	78.6%	—	Imported	2026-05-11
136	HyperCLOVA X SEED Think (32B)	78.5%	—	Imported	2026-05-11
137	Llama 3.3 Nemotron Super 49B v1 (Reasoning)	78.5%	—	Imported	2026-05-11
138	GLM-4.6 (Non-reasoning)	78.4%	GLM GLM 4.6 z-ai-glm-4.6	Imported	2026-05-11
139	Gemini 2.5 Flash Preview (Non-reasoning)	78.3%	—	Imported	2026-05-11
140	Gemini 2.0 Flash (experimental)	78.2%	Gemini 2.0 Flash google-gemini-2.0-flash	Imported	2026-05-11
141	GPT-4.1 mini	78.1%	GPT-4.1 Mini openai-gpt-4.1-mini	Imported	2026-05-11
142	GPT-5 nano (high)	78%	GPT-5 Nano openai-gpt-5-nano	Imported	2026-05-11
143	Gemini 2.0 Flash (Feb '25)	77.9%	Gemini 2.0 Flash google-gemini-2.0-flash	Imported	2026-05-11
144	Ling-flash-2.0	77.7%	—	Imported	2026-05-11
145	Qwen3 30B A3B (Reasoning)	77.7%	Qwen3 30B A3B qwen-qwen3-30b-a3b	Imported	2026-05-11
146	Qwen3 30B A3B 2507 Instruct	77.7%	—	Imported	2026-05-11
147	ERNIE 4.5 300B A47B	77.6%	ERNIE 4.5 300B A47B baidu-ernie-4.5-300b-a47b	Imported	2026-05-11
148	GPT-5 mini (minimal)	77.5%	GPT-5 Mini openai-gpt-5-mini	Imported	2026-05-11
149	gpt-oss-120B (low)	77.5%	gpt-oss-120b openai-gpt-oss-120b	Imported	2026-05-11
150	Qwen3 14B (Reasoning)	77.4%	Qwen3 14B qwen-qwen3-14b	Imported	2026-05-11
151	Apriel-v1.5-15B-Thinker	77.3%	—	Imported	2026-05-11
152	GPT-4o (ChatGPT)	77.3%	GPT-4o openai-gpt-4o	Imported	2026-05-11
153	Claude 3.5 Sonnet (Oct '24)	77.2%	Claude 3.5 Sonnet anthropic-claude-3.5-sonnet	Imported	2026-05-11
154	GPT-5 nano (medium)	77.2%	GPT-5 Nano openai-gpt-5-nano	Imported	2026-05-11
155	Nova 2.0 Pro Preview (Non-reasoning)	77.2%	—	Imported	2026-05-11
156	EXAONE 4.0 32B (Non-reasoning)	76.8%	—	Imported	2026-05-11
157	Magistral Small 1.2	76.8%	—	Imported	2026-05-11
158	Solar Pro 2 (Preview) (Reasoning)	76.8%	—	Imported	2026-05-11
159	Qwen3 VL 30B A3B Instruct	76.4%	Qwen3 VL 30B A3B Instruct qwen-qwen3-vl-30b-a3b-instruct	Imported	2026-05-11
160	QwQ 32B	76.4%	—	Imported	2026-05-11
161	Olmo 3.1 32B Think	76.3%	—	Imported	2026-05-11
162	Devstral 2	76.2%	—	Imported	2026-05-11
163	Qwen2.5 Max	76.2%	—	Imported	2026-05-11
164	Qwen3 235B A22B (Non-reasoning)	76.2%	Qwen3 235B A22B qwen-qwen3-235b-a22b	Imported	2026-05-11
165	K2-V2 (medium)	76.1%	—	Imported	2026-05-11
166	Claude 4.5 Haiku (Reasoning)	76%	—	Imported	2026-05-11
167	Mistral Medium 3	76%	Mistral: Mistral Medium 3 mistralai-mistral-medium-3	Imported	2026-05-11
168	Gemini 2.5 Flash-Lite (Reasoning)	75.9%	Gemini 2.5 Flash Lite google-gemini-2.5-flash-lite	Imported	2026-05-11
169	NVIDIA Nemotron Nano 12B v2 VL (Reasoning)	75.9%	Nemotron Nano 12B 2 VL nvidia-nemotron-nano-12b-v2-vl	Imported	2026-05-11
170	Olmo 3 32B Think	75.9%	OLMO Olmo 3 32B Think allenai-olmo-3-32b-think	Imported	2026-05-11
171	Sonar Pro	75.5%	Sonar Pro perplexity-sonar-pro	Imported	2026-05-11
172	Magistral Medium 1	75.3%	—	Imported	2026-05-11
173	DeepSeek V3 (Dec '24)	75.2%	DeepSeek V3 deepseek-deepseek-chat	Imported	2026-05-11
174	GLM-4.6V (Non-reasoning)	75.2%	GLM GLM 4.6V z-ai-glm-4.6v	Imported	2026-05-11
175	Llama 4 Scout	75.2%	Llama 4 Scout meta-llama-llama-4-scout	Imported	2026-05-11
176	Claude 3.5 Sonnet (June '24)	75.1%	Claude 3.5 Sonnet anthropic-claude-3.5-sonnet	Imported	2026-05-11
177	GLM-4.5V (Non-reasoning)	75.1%	GLM GLM 4.5V z-ai-glm-4.5v	Imported	2026-05-11
178	Gemini 1.5 Pro (Sep '24)	75%	—	Imported	2026-05-11
179	Solar Pro 2 (Non-reasoning)	75%	—	Imported	2026-05-11
180	Qwen3 VL 8B (Reasoning)	74.9%	—	Imported	2026-05-11
181	GPT-4o (Nov '24)	74.8%	GPT-4o openai-gpt-4o	Imported	2026-05-11
182	gpt-oss-20B (high)	74.8%	gpt-oss-20b openai-gpt-oss-20b	Imported	2026-05-11
183	Magistral Small 1	74.6%	—	Imported	2026-05-11
184	MiMo-V2-Flash (Non-reasoning)	74.4%	MiMo-V2-Flash xiaomi-mimo-v2-flash	Imported	2026-05-11
185	Grok 4.1 Fast (Non-reasoning)	74.3%	GROK Grok 4.1 Fast x-ai-grok-4.1-fast	Imported	2026-05-11
186	Nova 2.0 Lite (Non-reasoning)	74.3%	—	Imported	2026-05-11
187	Qwen3 4B 2507 (Reasoning)	74.3%	—	Imported	2026-05-11
188	Qwen3 8B (Reasoning)	74.3%	Qwen3 8B qwen-qwen3-8b	Imported	2026-05-11
189	NVIDIA Nemotron Nano 9B V2 (Reasoning)	74.2%	Nemotron Nano 9B V2 nvidia-nemotron-nano-9b-v2	Imported	2026-05-11
190	o1-mini	74.2%	—	Imported	2026-05-11
191	DeepSeek R1 Distill Qwen 14B	74%	—	Imported	2026-05-11
192	GPT-4o (May '24)	74%	GPT-4o (2024-05-13) openai-gpt-4o-2024-05-13	Imported	2026-05-11
193	DeepSeek R1 0528 Qwen3 8B	73.9%	—	Imported	2026-05-11
194	DeepSeek R1 Distill Qwen 32B	73.9%	R1 Distill Qwen 32B deepseek-deepseek-r1-distill-qwen-32b	Imported	2026-05-11
195	NVIDIA Nemotron Nano 9B V2 (Non-reasoning)	73.9%	Nemotron Nano 9B V2 nvidia-nemotron-nano-9b-v2	Imported	2026-05-11
196	Nova Premier	73.3%	—	Imported	2026-05-11
197	Llama 3.1 Instruct 405B	73.2%	—	Imported	2026-05-11
198	Grok 4 Fast (Non-reasoning)	73%	GROK Grok 4 Fast x-ai-grok-4-fast	Imported	2026-05-11
199	Hermes 4 - Llama-3.1 405B (Non-reasoning)	72.9%	—	Imported	2026-05-11
200	Qwen3 32B (Non-reasoning)	72.7%	Qwen3 32B qwen-qwen3-32b	Imported	2026-05-11
201	Falcon-H1R-7B	72.5%	—	Imported	2026-05-11
202	Qwen3 Omni 30B A3B Instruct	72.5%	—	Imported	2026-05-11
203	Solar Pro 2 (Preview) (Non-reasoning)	72.5%	—	Imported	2026-05-11
204	Gemini 2.0 Flash-Lite (Feb '25)	72.4%	Gemini 2.0 Flash Lite google-gemini-2.0-flash-lite-001	Imported	2026-05-11
205	Gemini 2.5 Flash-Lite (Non-reasoning)	72.4%	Gemini 2.5 Flash Lite google-gemini-2.5-flash-lite	Imported	2026-05-11
206	Qwen2.5 Instruct 72B	72%	Qwen2.5 72B Instruct qwen-qwen-2.5-72b-instruct	Imported	2026-05-11
207	Nova 2.0 Omni (Non-reasoning)	71.9%	—	Imported	2026-05-11
208	gpt-oss-20B (low)	71.8%	gpt-oss-20b openai-gpt-oss-20b	Imported	2026-05-11
209	Llama 3.1 Tulu3 405B	71.6%	—	Imported	2026-05-11
210	Phi-4	71.4%	Phi 4 microsoft-phi-4	Imported	2026-05-11
211	K2-V2 (low)	71.3%	—	Imported	2026-05-11
212	Llama 3.3 Instruct 70B	71.3%	—	Imported	2026-05-11
213	Command A	71.2%	C Command A cohere-command-a	Imported	2026-05-11
214	Qwen3 30B A3B (Non-reasoning)	71%	Qwen3 30B A3B qwen-qwen3-30b-a3b	Imported	2026-05-11
215	Grok 2 (Dec '24)	70.9%	—	Imported	2026-05-11
216	Devstral Medium	70.8%	Mistral: Devstral Medium mistralai-devstral-medium	Imported	2026-05-11
217	Qwen3 Coder 30B A3B Instruct	70.6%	Qwen3 Coder 30B A3B Instruct qwen-qwen3-coder-30b-a3b-instruct	Imported	2026-05-11
218	Grok Beta	70.3%	—	Imported	2026-05-11
219	Pixtral Large	70.1%	Mistral: Pixtral Large 2411 mistralai-pixtral-large-2411	Imported	2026-05-11
220	Qwen3 VL 4B (Reasoning)	70%	—	Imported	2026-05-11
221	Llama 3.3 Nemotron Super 49B v1 (Non-reasoning)	69.8%	—	Imported	2026-05-11
222	Mistral Large 2 (Nov '24)	69.7%	—	Imported	2026-05-11
223	Qwen2.5 Instruct 32B	69.7%	—	Imported	2026-05-11
224	Claude 3 Opus	69.6%	—	Imported	2026-05-11
225	Qwen3 4B (Reasoning)	69.6%	—	Imported	2026-05-11
226	Sarvam M (Reasoning)	69.6%	—	Imported	2026-05-11
227	GPT-4 Turbo	69.4%	GPT-4 Turbo openai-gpt-4-turbo	Imported	2026-05-11
228	Ministral 3 14B	69.3%	—	Imported	2026-05-11
229	Llama Nemotron Super 49B v1.5 (Non-reasoning)	69.2%	—	Imported	2026-05-11
230	Nova Pro	69.1%	Nova Pro 1.0 amazon-nova-pro-v1	Imported	2026-05-11
231	Llama 3.1 Nemotron Instruct 70B	69%	—	Imported	2026-05-11
232	Sonar	68.9%	Sonar perplexity-sonar	Imported	2026-05-11
233	Qwen3 VL 8B Instruct	68.6%	Qwen3 VL 8B Instruct qwen-qwen3-vl-8b-instruct	Imported	2026-05-11
234	Mistral Large 2 (Jul '24)	68.3%	Mistral Large 2407 mistralai-mistral-large-2407	Imported	2026-05-11
235	Mistral Medium 3.1	68.3%	Mistral: Mistral Medium 3.1 mistralai-mistral-medium-3.1	Imported	2026-05-11
236	Mistral Small 3.2	68.1%	—	Imported	2026-05-11
237	Gemini 1.5 Flash (Sep '24)	68%	—	Imported	2026-05-11
238	Devstral Small 2	67.8%	—	Imported	2026-05-11
239	Llama 3.1 Instruct 70B	67.6%	—	Imported	2026-05-11
240	Qwen3 14B (Non-reasoning)	67.5%	Qwen3 14B qwen-qwen3-14b	Imported	2026-05-11
241	Qwen3 4B 2507 Instruct	67.2%	—	Imported	2026-05-11
242	Ling-mini-2.0	67.1%	—	Imported	2026-05-11
243	Llama 3.2 Instruct 90B (Vision)	67.1%	—	Imported	2026-05-11
244	Gemma 3 27B Instruct	66.9%	Gemma 3 27B google-gemma-3-27b-it	Imported	2026-05-11
245	Reka Flash 3	66.9%	REKA Reka Flash 3 rekaai-reka-flash-3	Imported	2026-05-11
246	Hermes 4 - Llama-3.1 70B (Non-reasoning)	66.4%	—	Imported	2026-05-11
247	Mistral Small 3.1	65.9%	—	Imported	2026-05-11
248	Gemini 1.5 Pro (May '24)	65.7%	—	Imported	2026-05-11
249	GPT-4.1 nano	65.7%	GPT-4.1 Nano openai-gpt-4.1-nano	Imported	2026-05-11
250	Olmo 3 7B Think	65.5%	—	Imported	2026-05-11
251	Mistral Small 3	65.2%	—	Imported	2026-05-11
252	NVIDIA Nemotron Nano 12B v2 VL (Non-reasoning)	64.9%	Nemotron Nano 12B 2 VL nvidia-nemotron-nano-12b-v2-vl	Imported	2026-05-11
253	GPT-4o mini	64.8%	GPT-4o-mini openai-gpt-4o-mini	Imported	2026-05-11
254	QwQ 32B-Preview	64.8%	—	Imported	2026-05-11
255	Qwen3 8B (Non-reasoning)	64.3%	Qwen3 8B qwen-qwen3-8b	Imported	2026-05-11
256	Ministral 3 8B	64.2%	—	Imported	2026-05-11
257	Qwen2.5 Coder Instruct 32B	63.5%	Qwen2.5 Coder 32B Instruct qwen-qwen-2.5-coder-32b-instruct	Imported	2026-05-11
258	Claude 3.5 Haiku	63.4%	Claude 3.5 Haiku anthropic-claude-3.5-haiku	Imported	2026-05-11
259	Qwen3 VL 4B Instruct	63.4%	—	Imported	2026-05-11
260	Qwen2.5 Turbo	63.3%	Qwen-Turbo qwen-qwen-turbo	Imported	2026-05-11
261	Devstral Small (May '25)	63.2%	Mistral: Devstral Small 1.1 mistralai-devstral-small	Imported	2026-05-11
262	Granite 4.0 H Small	62.4%	—	Imported	2026-05-11
263	Devstral Small (Jul '25)	62.2%	Mistral: Devstral Small 1.1 mistralai-devstral-small	Imported	2026-05-11
264	Qwen2 Instruct 72B	62.2%	—	Imported	2026-05-11
265	Mistral Saba	61.1%	Mistral: Saba mistralai-mistral-saba	Imported	2026-05-11
266	Gemma 3 12B Instruct	59.5%	Gemma 3 12B google-gemma-3-12b-it	Imported	2026-05-11
267	Nova Lite	59%	Nova Lite 1.0 amazon-nova-lite-v1	Imported	2026-05-11
268	Exaone 4.0 1.2B (Reasoning)	58.8%	—	Imported	2026-05-11
269	Qwen3 4B (Non-reasoning)	58.6%	—	Imported	2026-05-11
270	Kimi Linear 48B A3B Instruct	58.5%	—	Imported	2026-05-11
271	DeepHermes 3 - Mistral 24B Preview (Non-reasoning)	58%	—	Imported	2026-05-11
272	Claude 3 Sonnet	57.9%	—	Imported	2026-05-11
273	NVIDIA Nemotron 3 Nano 30B A3B (Non-reasoning)	57.9%	Nemotron 3 Nano 30B A3B nvidia-nemotron-3-nano-30b-a3b	Imported	2026-05-11
274	Jamba 1.7 Large	57.7%	—	Imported	2026-05-11
275	Jamba Reasoning 3B	57.7%	—	Imported	2026-05-11
276	Gemini 1.5 Flash (May '24)	57.4%	—	Imported	2026-05-11
277	Llama 3 Instruct 70B	57.4%	—	Imported	2026-05-11
278	Jamba 1.5 Large	57.2%	—	Imported	2026-05-11
279	Hermes 3 - Llama-3.1 70B	57.1%	L Hermes 3 70B Instruct nousresearch-hermes-3-llama-3.1-70b	Imported	2026-05-11
280	Qwen3 1.7B (Reasoning)	57%	—	Imported	2026-05-11
281	Gemini 1.5 Flash-8B	56.9%	—	Imported	2026-05-11
282	Jamba 1.6 Large	56.5%	—	Imported	2026-05-11
283	GPT-5 nano (minimal)	55.6%	GPT-5 Nano openai-gpt-5-nano	Imported	2026-05-11
284	Llama 3.1 Nemotron Nano 4B v1.1 (Reasoning)	55.6%	—	Imported	2026-05-11
285	DeepSeek R1 Distill Llama 8B	54.3%	—	Imported	2026-05-11
286	Mixtral 8x22B Instruct	53.7%	Mistral: Mixtral 8x22B Instruct mistralai-mixtral-8x22b-instruct	Imported	2026-05-11
287	Nova Micro	53.1%	Nova Micro 1.0 amazon-nova-micro-v1	Imported	2026-05-11
288	Mistral Small (Sep '24)	52.9%	—	Imported	2026-05-11
289	Ministral 3 3B	52.4%	—	Imported	2026-05-11
290	Olmo 3 7B Instruct	52.2%	—	Imported	2026-05-11
291	Mistral Large (Feb '24)	51.5%	Mistral Large mistralai-mistral-large	Imported	2026-05-11
292	OLMo 2 32B	51.1%	—	Imported	2026-05-11
293	LFM2 8B A1B	50.5%	—	Imported	2026-05-11
294	Exaone 4.0 1.2B (Non-reasoning)	50%	—	Imported	2026-05-11
295	Claude 2.1	49.5%	—	Imported	2026-05-11
296	Mistral Medium	49.1%	—	Imported	2026-05-11
297	Gemma 3n E4B Instruct	48.8%	—	Imported	2026-05-11
298	Claude 2.0	48.6%	—	Imported	2026-05-11
299	Phi-4 Multimodal Instruct	48.5%	—	Imported	2026-05-11
300	Gemma 3n E4B Instruct Preview (May '25)	48.3%	—	Imported	2026-05-11
301	Llama 3.1 Instruct 8B	47.6%	—	Imported	2026-05-11
302	Qwen2.5 Coder Instruct 7B	47.3%	—	Imported	2026-05-11
303	Granite 3.3 8B (Non-reasoning)	46.8%	—	Imported	2026-05-11
304	Phi-4 Mini Instruct	46.5%	—	Imported	2026-05-11
305	Llama 3.2 Instruct 11B (Vision)	46.4%	—	Imported	2026-05-11
306	GPT-3.5 Turbo	46.2%	GPT-3.5 Turbo openai-gpt-3.5-turbo	Imported	2026-05-11
307	Granite 4.0 Micro	44.7%	Granite 4.0 Micro ibm-granite-granite-4.0-h-micro	Imported	2026-05-11
308	Phi-3 Mini Instruct 3.8B	43.5%	—	Imported	2026-05-11
309	Claude Instant	43.4%	—	Imported	2026-05-11
310	Command-R+ (Apr '24)	43.2%	C Command R (08-2024) cohere-command-r-08-2024	Imported	2026-05-11
311	Gemini 1.0 Pro	43.1%	—	Imported	2026-05-11
312	DeepSeek Coder V2 Lite Instruct	42.9%	—	Imported	2026-05-11
313	LFM 40B	42.5%	—	Imported	2026-05-11
314	Mistral Small (Feb '24)	41.9%	—	Imported	2026-05-11
315	Gemma 3 4B Instruct	41.7%	Gemma 3 4B google-gemma-3-4b-it	Imported	2026-05-11
316	Qwen3 1.7B (Non-reasoning)	41.1%	—	Imported	2026-05-11
317	Llama 2 Chat 13B	40.6%	—	Imported	2026-05-11
318	Llama 2 Chat 70B	40.6%	—	Imported	2026-05-11
319	Llama 3 Instruct 8B	40.5%	—	Imported	2026-05-11
320	DBRX Instruct	39.7%	—	Imported	2026-05-11
321	Jamba 1.7 Mini	38.8%	—	Imported	2026-05-11
322	Mixtral 8x7B Instruct	38.7%	Mistral: Mixtral 8x7B Instruct mistralai-mixtral-8x7b-instruct	Imported	2026-05-11
323	Gemma 3n E2B Instruct	37.8%	—	Imported	2026-05-11
324	Jamba 1.5 Mini	37.1%	—	Imported	2026-05-11
325	Molmo 7B-D	37.1%	—	Imported	2026-05-11
326	Jamba 1.6 Mini	36.7%	—	Imported	2026-05-11
327	DeepHermes 3 - Llama-3.1 8B Preview (Non-reasoning)	36.5%	—	Imported	2026-05-11
328	Llama 3.2 Instruct 3B	34.7%	—	Imported	2026-05-11
329	Qwen3 0.6B (Reasoning)	34.7%	—	Imported	2026-05-11
330	Command-R (Mar '24)	33.8%	C Command R (08-2024) cohere-command-r-08-2024	Imported	2026-05-11
331	Granite 4.0 1B	32.5%	—	Imported	2026-05-11
332	OpenChat 3.5 (1210)	31%	—	Imported	2026-05-11
333	LFM2 2.6B	29.8%	—	Imported	2026-05-11
334	OLMo 2 7B	28.2%	—	Imported	2026-05-11
335	Granite 4.0 H 1B	27.7%	—	Imported	2026-05-11
336	DeepSeek R1 Distill Qwen 1.5B	26.9%	—	Imported	2026-05-11
337	LFM2 1.2B	25.7%	—	Imported	2026-05-11
338	Mistral 7B Instruct	24.5%	—	Imported	2026-05-11
339	Qwen3 0.6B (Non-reasoning)	23.1%	—	Imported	2026-05-11
340	Llama 3.2 Instruct 1B	20%	—	Imported	2026-05-11
341	Llama 2 Chat 7B	16.4%	—	Imported	2026-05-11
342	Gemma 3 1B Instruct	13.5%	—	Imported	2026-05-11
343	Granite 4.0 H 350M	12.7%	—	Imported	2026-05-11
344	Granite 4.0 350M	12.4%	—	Imported	2026-05-11
345	Gemma 3 270M	5.5%	—	Imported	2026-05-11

Metadata

Metrics

Latest Results