BenchmarkList General

Default general-purpose frontier-model leaderboard basket across reasoning, coding, writing, agentic, multimodal, and structured-output benchmarks.

123models
35benchmarks

Ranked Models

General Coding Agents & Tools Data Multimodal Other
# Model Avg Rank HLE GD BenchLM MMLU-Redux LiveBench MMLU-ProX ABL Multi-IF CWV BZ WritingBen AHV EQ-Bench ALE-Bench SBP LiveCodeBe BH NL2Repo GLR ARC-AGI-2 PinchBench BFC MCPMark AutoBench SO EG OSWorld- LW TA Automation LiveSQLBen VideoMMMU MMMU-Pro CC-OCR MATH-500
1 Claude Mythos Preview Claude 1.0
5/35 rows
#1 #1 #1 #1 #1
2 Claude Opus 4.8 Claude 1.4
5/35 rows
#1 #3 #1 #1 #1
3 Qwen3.7 Max Qwen 1.4
8/35 rows
#1 #1 #3 #1 #1 #2 #2 #1
4 GPT-5.5 GPT 2.1
13/35 rows
#2 #2 #3 #1 #1 #2 #1 #1 #7 #1 #2 #3 #1
5 Claude Opus 4.7 Claude 2.5
10/35 rows
#2 #2 #5 #1 #2 #4 #1 #4 #2 #3
6 Claude Opus 4.6 Claude 4.3
19/35 rows
#2 #2 #10 #2 #2 #1 #5 #4 #1 #6 #14 #1 #4 #2 #12 #1 #4 #4 #2
7 Qwen3 235B A22B Thinking 2507 Qwen 4.4
7/35 rows
#5 #5 #1 #5 #1 #3 #10
8 GPT-5.4 GPT 5.1
15/35 rows
#4 #5 #8 #2 #23 #3 #3 #8 #5 #3 #6 #1 #3 #8 #2
9 Gemini 3.1 Pro Preview Gemini 5.7
17/35 rows
#1 #1 #2 #3 #4 #19 #4 #24 #3 #18 #3 #2 #4 #4 #4 #1 #3
10 KIMI MoonshotAI: Kimi K2.6 Kimi 6.3
14/35 rows
#4 #3 #12 #1 #23 #6 #22 #2 #3 #3 #5 #5 #5 #5
11 Qwen3 VL 235B A22B Thinking Qwen 6.4
7/35 rows
#6 #7 #3 #6 #3 #15 #5
12 DeepSeek V4 Pro DeepSeek 7.4
13/35 rows
#3 #5 #9 #4 #13 #4 #26 #3 #1 #5 #15 #3 #13
13 Qwen3 Next 80B A3B Thinking Qwen 9.7
6/35 rows
#15 #10 #5 #9 #10 #9
14 GLM GLM 5.1 GLM 9.7
15/35 rows
#5 #6 #14 #6 #29 #5 #35 #4 #1 #11 #29 #2 #5 #3 #6
15 GPT-5.3-Codex Codex 10.5
9/35 rows
#6 #6 #11 #18 #14 #2 #22 #7 #9
16 GPT-5.2 GPT 12.3
14/35 rows
#13 #13 #19 #11 #3 #10 #23 #28 #16 #1 #6 #9 #4 #7
17 GPT-5.4 Pro GPT 13.1
5/35 rows
#1 #1 #4 #2 #67
18 Qwen3.6 Plus Qwen 13.1
17/35 rows
#6 #4 #29 #2 #28 #1 #52 #6 #5 #3 #17 #61 #6 #8 #8 #8 #1
19 Qwen3 VL 30B A3B Thinking Qwen 13.4
8/35 rows
#20 #15 #12 #11 #7 #13 #17 #13
20 Qwen3 VL 8B Thinking Qwen 14.7
8/35 rows
#25 #19 #9 #12 #5 #14 #19 #16
21 Gemini 3 Gemini 15.2
13/35 rows
#9 #11 #18 #27 #17 #14 #35 #56 #3 #2 #9 #2 #1
22 Claude Opus 4.5 Claude 15.4
10/35 rows
#29 #39 #23 #7 #16 #1 #7 #3 #6 #18
23 Claude Sonnet 4.6 Claude 15.6
10/35 rows
#24 #29 #15 #20 #9 #23 #13 #4 #11 #2
24 GPT-5.2-Codex Codex 16.2
6/35 rows
#18 #15 #24 #15 #9 #16
25 Gemini 3 Flash Preview Gemini 16.6
13/35 rows
#15 #16 #40 #19 #6 #6 #18 #33 #17 #13 #22 #5 #2
26 Claude Sonnet 4.5 Claude 17.4
11/35 rows
#37 #22 #13 #46 #11 #2 #9 #7 #13 #12 #23
27 Qwen3.5 397B A17B Qwen 18.7
8/35 rows
#35 #19 #1 #1 #1 #58 #27 #7
28 GPT-5 GPT 20.6
13/35 rows
#38 #51 #22 #18 #50 #3 #16 #8 #3 #8 #11 #6 #11
29 Qwen3.5-122B-A10B Qwen 21.7
11/35 rows
#55 #48 #39 #4 #3 #2 #25 #17 #9 #13 #4
30 MiMo-V2.5-Pro Xiaomi 22.0
4/35 rows
#16 #40 #34 #4
31 GROK Grok 4.3 Grok 22.1
5/35 rows
#14 #14 #41 #30 #20
32 KIMI MoonshotAI: Kimi K2.5 Kimi 22.4
13/35 rows
#26 #27 #25 #32 #2 #37 #2 #37 #49 #28 #9 #11 #3
33 GROK Grok 4.20 Grok 24.0
9/35 rows
#20 #8 #38 #34 #20 #39 #20 #32 #11
34 Claude Haiku 4.5 Claude 26.6
7/35 rows
#52 #28 #75 #6 #6 #12 #7
35 DeepSeek V4 Flash DeepSeek 27.3
6/35 rows
#21 #18 #26 #40 #50 #21
36 Qwen3.5-27B Qwen 27.8
13/35 rows
#62 #46 #45 #11 #3 #32 #9 #75 #38 #4 #10 #12 #7
37 GPT-5.1 GPT 28.1
10/35 rows
#39 #32 #21 #21 #31 #15 #47 #43 #3 #10
38 MiniMax M2.7 MiniMax 28.9
10/35 rows
#31 #31 #46 #45 #61 #2 #50 #5 #10 #13
39 MiMo-V2-Pro Xiaomi 29.0
6/35 rows
#30 #35 #45 #41 #15 #7
40 GPT-5.4 Mini GPT 29.5
8/35 rows
#37 #30 #39 #16 #41 #48 #15 #6
41 GLM GLM 4.7 GLM 30.8
9/35 rows
#49 #45 #34 #5 #72 #35 #14 #5 #22
42 Claude Sonnet 4 Claude 31.2
5/35 rows
#20 #62 #38 #15 #16
43 GROK Grok 4 Grok 31.5
7/35 rows
#52 #28 #42 #31 #44 #9 #10
44 o3 o-series 31.9
11/35 rows
#69 #79 #53 #2 #58 #8 #18 #6 #15 #11 #14
45 Qwen3.6 27B Qwen 32.3
8/35 rows
#63 #64 #28 #7 #4 #45 #7 #6
46 Qwen3.6 35B A3B Qwen 33.7
8/35 rows
#67 #65 #36 #9 #5 #42 #9 #3
47 Qwen3.5-35B-A3B Qwen 34.5
13/35 rows
#72 #60 #56 #9 #5 #3 #85 #52 #43 #18 #11 #14 #8
48 DeepSeek V3.2 DeepSeek 36.2
13/35 rows
#60 #67 #47 #48 #16 #6 #9 #43 #76 #30 #8 #29 #12
49 KIMI MoonshotAI: Kimi K2 Thinking Kimi 37.4
8/35 rows
#59 #69 #3 #8 #15 #62 #49 #19
50 GROK Grok 4.1 Fast Grok 37.9
10/35 rows
#82 #52 #33 #34 #73 #29 #34 #5 #16 #12
51 Gemma 4 31B Gemma 38.0
9/35 rows
#56 #47 #50 #33 #44 #47 #21 #21 #13
52 GPT-5.1-Codex Codex 38.0
5/35 rows
#54 #44 #31 #12 #34
53 R1 0528 DeepSeek 38.1
5/35 rows
#8 #13 #39 #5 #115
54 GLM GLM 5 GLM 38.5
10/35 rows
#36 #86 #17 #36 #7 #46 #26 #69 #20 #15
55 Qwen3 Max Thinking Qwen 38.9
4/35 rows
#41 #42 #33 #39
56 GPT-5 Mini GPT 39.9
11/35 rows
#71 #77 #42 #40 #71 #40 #17 #11 #20 #18 #10
57 Qwen3 235B A22B Instruct 2507 Qwen 40.8
11/35 rows
#95 #112 #12 #8 #6 #3 #7 #4 #111 #23 #20
58 MiMo-V2-Flash Xiaomi 41.8
6/35 rows
#65 #58 #48 #1 #48 #8
59 GPT-5.4 Nano GPT 41.8
8/35 rows
#40 #88 #26 #27 #64 #42 #23 #12
60 GLM GLM 5 Turbo GLM 44.8
4/35 rows
#47 #56 #56 #19
61 MiniMax M2.5 MiniMax 45.0
8/35 rows
#74 #55 #5 #59 #36 #68 #14 #22
62 Nemotron 3 Super Nemotron 45.7
9/35 rows
#73 #101 #9 #6 #86 #60 #10 #20 #15
63 Gemini 2.5 Pro Gemini 47.4
8/35 rows
#64 #61 #41 #30 #67 #53 #29 #20
64 MiniMax M2.1 MiniMax 47.8
5/35 rows
#61 #75 #57 #12 #23
65 MiMo-V2.5 Xiaomi 48.3
4/35 rows
#48 #53 #68 #32
66 o4 Mini o-series 49.2
7/35 rows
#83 #118 #3 #61 #21 #26 #14
67 S Step 3.5 Flash StepFun 50.8
4/35 rows
#57 #74 #33 #26
68 gpt-oss-120b GPT 51.3
11/35 rows
#78 #121 #84 #15 #64 #8 #54 #59 #36 #24 #14
69 Gemma 4 26B A4B Gemma 51.6
8/35 rows
#80 #109 #32 #53 #31 #31 #23 #20
70 Gemini 3.1 Flash Lite Preview Gemini 54.0
5/35 rows
#89 #85 #48 #19 #5
71 R1 DeepSeek 54.6
9/35 rows
#96 #90 #86 #12 #6 #14 #107 #7 #18
72 Qwen3 235B A22B Qwen 57.5
9/35 rows
#129 #203 #68 #28 #7 #13 #9 #22 #17
73 Qwen3 Next 80B A3B Instruct Qwen 58.9
9/35 rows
#202 #169 #20 #13 #8 #8 #2 #2 #18
74 KIMI MoonshotAI: Kimi K2 0711 Kimi 59.0
9/35 rows
#213 #141 #75 #13 #6 #11 #11 #13 #6
75 Qwen3.5-9B Qwen 59.9
6/35 rows
#108 #99 #19 #14 #10 #63
76 GROK Grok 4 Fast Grok 60.6
4/35 rows
#87 #57 #65 #20
77 Gemini 2.5 Flash Gemini 61.7
9/35 rows
#135 #111 #80 #85 #57 #15 #31 #8 #8
78 Qwen3 VL 235B A22B Instruct Qwen 61.8
10/35 rows
#235 #195 #16 #11 #7 #4 #5 #5 #18 #2
79 GLM GLM 4.6 GLM 62.8
5/35 rows
#106 #122 #77 #12 #4
80 DeepSeek V3.2 Exp DeepSeek 62.8
4/35 rows
#103 #104 #12 #14
81 Qwen3 VL 30B A3B Instruct Qwen 68.3
10/35 rows
#230 #211 #27 #18 #18 #9 #13 #12 #21 #8
82 GLM GLM 5V Turbo GLM 70.3
4/35 rows
#91 #97 #56 #24
83 I Mercury 2 Inception 72.3
5/35 rows
#92 #134 #43 #44 #16
84 o3-mini o-series 74.9
8/35 rows
#176 #163 #57 #2 #15 #1 #91 #10
85 Qwen3 VL 32B Instruct Qwen 75.0
9/35 rows
#236 #231 #22 #16 #14 #7 #12 #9 #10
86 DeepSeek V3.1 DeepSeek 75.9
5/35 rows
#112 #123 #90 #18 #10
87 GLM GLM 4.5 GLM 76.2
6/35 rows
#123 #120 #93 #76 #30 #3
88 Nemotron 3 Nano 30B A3B Nemotron 76.9
7/35 rows
#149 #153 #97 #22 #8 #25 #17
89 DeepSeek V3.1 Terminus DeepSeek 78.3
5/35 rows
#94 #108 #96 #47 #28
90 KIMI MoonshotAI: Kimi K2 0905 Kimi 82.4
7/35 rows
#233 #139 #13 #81 #21 #13 #6
91 A Trinity Large Thinking Arcee 83.8
4/35 rows
#99 #156 #59 #2
92 o3 Mini High o-series 87.0
4/35 rows
#122 #130 #9 #80
93 GPT-5 Nano GPT 88.8
9/35 rows
#183 #225 #39 #49 #84 #58 #24 #34 #17
94 gpt-oss-20b GPT 94.0
8/35 rows
#157 #214 #109 #25 #65 #60 #28 #28
95 Qwen3 Coder Next Qwen 97.4
4/35 rows
#168 #171 #3 #41
96 Mistral: Mistral Small 4 Mistral 98.5
5/35 rows
#166 #135 #69 #46 #26
97 GPT-4.1 GPT 102.3
11/35 rows
#345 #236 #51 #15 #66 #7 #58 #126 #20 #33 #15
98 Qwen3 Max Qwen 103.6
4/35 rows
#137 #144 #74 #25
99 GLM GLM 4.5 Air GLM 113.2
5/35 rows
#217 #174 #108 #22 #4
100 Llama 4 Maverick Llama 116.1
10/35 rows
#326 #230 #111 #19 #88 #13 #133 #62 #32 #28
101 GROK Grok Code Fast 1 Grok 123.7
5/35 rows
#198 #179 #78 #63 #23
102 DeepSeek V3 DeepSeek 124.3
10/35 rows
#450 #310 #82 #24 #27 #21 #54 #27 #24 #27
103 o1 o-series 127.5
4/35 rows
#196 #164 #55 #2
104 GPT-4.1 Mini GPT 129.6
8/35 rows
#346 #238 #70 #17 #8 #136 #27 #38
105 Gemma 3 27B Gemma 140.5
7/35 rows
#334 #378 #110 #10 #69 #17 #16
106 GPT-4o GPT 144.0
6/35 rows
#303 #248 #74 #55 #26 #35
107 Claude 3.7 Sonnet Claude 144.5
5/35 rows
#322 #245 #4 #21 #14
108 Qwen3 VL 8B Instruct Qwen 151.8
8/35 rows
#475 #381 #31 #20 #9 #11 #22 #11
109 Llama 4 Scout Llama 157.8
8/35 rows
#378 #290 #106 #26 #134 #68 #72 #27
110 Gemini 2.5 Flash Lite Gemini 159.8
4/35 rows
#228 #261 #66 #52
111 GPT-4.1 Nano GPT 165.6
8/35 rows
#425 #339 #95 #20 #22 #135 #58 #39
112 GPT-4o (2024-08-06) GPT 173.2
6/35 rows
#474 #329 #19 #24 #18 #23
113 Gemini 2.0 Flash Gemini 173.8
4/35 rows
#269 #255 #106 #14
114 Phi 4 Phi 177.7
6/35 rows
#408 #298 #92 #36 #70 #25
115 Qwen3 Coder 480B A35B Qwen 190.4
4/35 rows
#374 #266 #70 #4
116 Claude 3.5 Sonnet Claude 192.3
5/35 rows
#422 #280 #77 #23 #36
117 GPT-4 Turbo GPT 192.5
4/35 rows
#467 #98 #25 #17
118 Gemma 3 12B Gemma 221.3
4/35 rows
#324 #416 #66 #18
119 GPT-4o-mini GPT 226.0
5/35 rows
#412 #382 #63 #50 #50
120 Gemma 3 4B Gemma 226.7
4/35 rows
#276 #453 #101 #27
121 Qwen2.5 72B Instruct Qwen 246.7
4/35 rows
#398 #349 #64 #29
122 Claude 3 Haiku Claude 263.5
4/35 rows
#421 #407 #102 #28
123 Qwen2.5 Coder 32B Instruct Qwen 264.6
4/35 rows
#434 #386 #39 #24

Benchmark Groups

Group Weight Benchmark Rows
Core drivers 1.5x Humanity's Last Exam 176
Core drivers 1.5x SWE-bench Pro 22
Core drivers 1.5x Creative Writing v3 9
High-signal support 1.25x ARC-AGI-2 51
High-signal support 1.25x GPQA Diamond 176
High-signal support 1.25x MATH-500 10
High-signal support 1.25x LiveCodeBench 21
High-signal support 1.25x Gert Labs Rankings 59
High-signal support 1.25x Structured Output Benchmark 19
High-signal support 1.25x Berkeley Function-Calling Leaderboard 36
High-signal support 1.25x OSWorld-Verified 15
Balanced support 1x MMLU-ProX 26
Balanced support 1x MMLU-Redux 35
Balanced support 1x LiveBench 26
Balanced support 1x MMMU-Pro 15
Balanced support 1x NL2Repo 9
Balanced support 1x EQ-Bench 8
Balanced support 1x WritingBench 12
Balanced support 1x PinchBench 63
Balanced support 1x EnterpriseOps-Gym 19
Balanced support 1x AutomationBench 5
Balanced support 1x AutoBench 32
Balanced support 1x MCPMark 33
Balanced support 1x Tau2 Airline 14
Balanced support 1x LLM-WikiRace 16
Balanced support 1x LiveSQLBench 25
Balanced support 1x Multi-IF 18
Balanced support 1x BrowseComp-zh 12
Balanced support 1x VideoMMMU 21
Breadth / tie-breakers 0.75x Arena-Hard v2 12
Breadth / tie-breakers 0.75x BigCodeBench-Hard 13
Breadth / tie-breakers 0.75x CC-OCR 14
Breadth / tie-breakers 0.75x BenchLM 74
Breadth / tie-breakers 0.75x ALL Bench LLM 27
Breadth / tie-breakers 0.75x ALE-Bench 65