BenchBench

Benchmark Agreement Testing leaderboard that aggregates model scores across benchmarks and analyzes benchmark agreement/correlation under a standardized BAT methodology.

137rows
aggregate_scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Aggregate Score

Latest Results

Rows are parsed from the public cached aggregate model-score CSV used by the BenchBench Space. Agreement details are separately available in the cached agreements CSV.

Rank Subject Aggregate Score Model Match Provenance Sampled
1 gpt_4o_2024_05_13 0.98 GPT-4o (2024-05-13)
openai-gpt-4o-2024-05-13
Imported 2026-05-06
2 chatgpt_4o_latest 0.98 Imported 2026-05-06
3 gpt_4o_2024_08_06 0.97 GPT-4o (2024-08-06)
openai-gpt-4o-2024-08-06
Imported 2026-05-06
4 claude_3_5_sonnet_20240620 0.96 Claude 3.5 Sonnet
anthropic-claude-3.5-sonnet
Imported 2026-05-06
5 gemini_1_5_pro_exp_0801 0.95 Imported 2026-05-06
6 llama3_1_70b_instruct 0.93 Llama 3.1 70B Instruct
meta-llama-llama-3.1-70b-instruct
Imported 2026-05-06
7 gpt_4_turbo_2024_04_09 0.91 GPT-4 Turbo
openai-gpt-4-turbo
Imported 2026-05-06
8 claude_3_opus_20240229 0.88 Imported 2026-05-06
9 yi_large_preview 0.87 Imported 2026-05-06
10 llama3_1_405b_instruct 0.86 Imported 2026-05-06
11 gpt_4_0125_preview 0.85 GPT-4 Turbo Preview
openai-gpt-4-turbo-preview
Imported 2026-05-06
12 hermes_3_llama3_1_70b 0.85 L Hermes 3 70B Instruct
nousresearch-hermes-3-llama-3.1-70b
Imported 2026-05-06
13 zephyr_orpo_141b_a35b_v0_1 0.84 Imported 2026-05-06
14 mistral_large_2407 0.84 Mistral Large 2407
mistralai-mistral-large-2407
Imported 2026-05-06
15 gpt_4o_mini_2024_07_18 0.83 GPT-4o-mini (2024-07-18)
openai-gpt-4o-mini-2024-07-18
Imported 2026-05-06
16 claude_2_0 0.83 Imported 2026-05-06
17 smaug_qwen2_72b_instruct 0.83 Imported 2026-05-06
18 gemini_1_5_pro_api_0514 0.83 Imported 2026-05-06
19 llama3_70b_instruct 0.82 Llama 3 70B Instruct
meta-llama-llama-3-70b-instruct
Imported 2026-05-06
20 llama3_70b 0.81 Imported 2026-05-06
21 gemma_2_9b_it_dpo 0.81 Imported 2026-05-06
22 llama3_instruct_8b_simpo 0.80 Imported 2026-05-06
23 yi_large 0.79 Imported 2026-05-06
24 gemma_2_27b_it 0.78 Gemma 2 27B
google-gemma-2-27b-it
Imported 2026-05-06
25 qwen2_72b_instruct 0.77 Imported 2026-05-06
26 qwen1_5_32b 0.77 Imported 2026-05-06
27 gpt_4_0613 0.76 GPT-4
openai-gpt-4
Imported 2026-05-06
28 phi_3_5_moe_instruct 0.76 Imported 2026-05-06
29 qwen1_5_110b_chat 0.74 Imported 2026-05-06
30 mixtral_8x22b_v0_1 0.74 Imported 2026-05-06
31 gemma_2_9b_it_simpo 0.73 Imported 2026-05-06
32 gemini_pro 0.73 Imported 2026-05-06
33 llama_2_70b 0.73 Imported 2026-05-06
34 gemini_1_5_flash_api_0514 0.73 Imported 2026-05-06
35 yi_34b 0.72 Imported 2026-05-06
36 deepseek_coder_v2 0.71 Imported 2026-05-06
37 nous_hermes_2_mixtral_8x7b_dpo 0.71 Imported 2026-05-06
38 gpt_3_5_turbo_0613 0.69 GPT-3.5 Turbo (older v0613)
openai-gpt-3.5-turbo-0613
Imported 2026-05-06
39 claude_2_1 0.67 Imported 2026-05-06
40 yi_1_5_34b_chat 0.67 Imported 2026-05-06
41 mistral_medium 0.66 Imported 2026-05-06
42 phi_3_small_128k_instruct 0.66 Imported 2026-05-06
43 infinity_instruct_3m_0625_llama3_8b 0.65 Imported 2026-05-06
44 claude_instant_1_2 0.65 Imported 2026-05-06
45 mistral_v0_1_7b 0.62 Imported 2026-05-06
46 command_r_plus 0.62 Imported 2026-05-06
47 phi_3_5_mini_instruct 0.61 Imported 2026-05-06
48 llama3_1_8b_instruct 0.61 Llama 3.1 8B Instruct
meta-llama-llama-3.1-8b-instruct
Imported 2026-05-06
49 gemma_2_9b_it 0.60 Imported 2026-05-06
50 yi_1_5_9b_chat 0.60 Imported 2026-05-06
51 claude_3_sonnet_20240229 0.60 Imported 2026-05-06
52 mixtral_8x22b_instruct_v0_1 0.59 Imported 2026-05-06
53 qwen1_5_14b 0.58 Imported 2026-05-06
54 llama_65b 0.58 Imported 2026-05-06
55 deepseek_llm_67b_chat 0.57 Imported 2026-05-06
56 qwen1_5_32b_chat 0.57 Imported 2026-05-06
57 wizardlm_70b 0.56 Imported 2026-05-06
58 yi_34b_chat 0.56 Imported 2026-05-06
59 qwen1_5_72b_chat 0.55 Imported 2026-05-06
60 dbrx_instructruct 0.54 Imported 2026-05-06
61 jurassic_2_jumbo_178b 0.53 Imported 2026-05-06
62 mixtral_8x7b_v0_1 0.53 Imported 2026-05-06
63 openchat_3_5 0.53 Imported 2026-05-06
64 mistral_large_2402 0.51 Imported 2026-05-06
65 solar_10_7b_instruct_v1_0 0.50 Imported 2026-05-06
66 qwen2_7b_instruct 0.50 Imported 2026-05-06
67 phi_3_medium_4k_instruct 0.49 Imported 2026-05-06
68 dolphin_2_2_1_mistral_7b 0.48 Imported 2026-05-06
69 mistral_small_2402 0.48 Imported 2026-05-06
70 glm_4_9b_chat 0.48 Imported 2026-05-06
71 dbrx_instruct 0.47 Imported 2026-05-06
72 qwen1_5_14b_chat 0.45 Imported 2026-05-06
73 claude_3_haiku_20240307 0.45 Claude 3 Haiku
anthropic-claude-3-haiku
Imported 2026-05-06
74 gemma_7b 0.45 Imported 2026-05-06
75 llama3_8b_instruct 0.44 Llama 3 8B Instruct
meta-llama-llama-3-8b-instruct
Imported 2026-05-06
76 llama3_8b 0.44 Imported 2026-05-06
77 wizardlm_13b 0.43 Imported 2026-05-06
78 starling_lm_7b_alpha 0.43 Imported 2026-05-06
79 jurassic_2_grande_17b 0.42 Imported 2026-05-06
80 mistral_7b_v0_3 0.42 Imported 2026-05-06
81 llama_2_13b 0.41 Imported 2026-05-06
82 llama_2_70b_chat 0.41 Imported 2026-05-06
83 phi_3_mini_4k_instruct 0.40 Imported 2026-05-06
84 openhermes_2_5_mistral_7b 0.40 Imported 2026-05-06
85 llama_2_13b_chat 0.39 Imported 2026-05-06
86 guanaco_33b 0.38 Imported 2026-05-06
87 phi_3_mini_128k_instruct 0.38 Imported 2026-05-06
88 mistral_7b_v0_2 0.38 Imported 2026-05-06
89 internlm2_chat_20b 0.37 Imported 2026-05-06
90 starling_lm_7b_beta 0.36 Imported 2026-05-06
91 gpt_3_5_turbo_0125 0.36 GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-06
92 tulu_2_dpo_70b 0.36 Imported 2026-05-06
93 qwen1_5_7b 0.35 Imported 2026-05-06
94 falcon_40b 0.35 Imported 2026-05-06
95 yi_1_5_6b_chat 0.34 Imported 2026-05-06
96 zephyr_7b_alpha 0.34 Imported 2026-05-06
97 command_r 0.33 C Command R (08-2024)
cohere-command-r-08-2024
Imported 2026-05-06
98 luminous_supreme_70b 0.33 Imported 2026-05-06
99 yi_6b 0.30 Imported 2026-05-06
100 zephyr_7b_beta 0.29 Imported 2026-05-06
101 mixtral_8x7b_instruct_v0_1 0.28 Mistral: Mixtral 8x7B Instruct
mistralai-mixtral-8x7b-instruct
Imported 2026-05-06
102 qwen_14b_chat 0.28 Imported 2026-05-06
103 gemma_2_2b_it 0.28 Imported 2026-05-06
104 phi_3_small_8k_instruct 0.27 Imported 2026-05-06
105 gemma_1_1_7b_it 0.26 Imported 2026-05-06
106 llama_2_7b 0.25 Imported 2026-05-06
107 mistral_7b_instruct_v0_2 0.25 Imported 2026-05-06
108 mistral_7b_instruct_v0_3 0.25 Imported 2026-05-06
109 qwen1_5_7b_chat 0.24 Imported 2026-05-06
110 alpaca_7b 0.23 Imported 2026-05-06
111 luminous_extended_30b 0.23 Imported 2026-05-06
112 llama_13b 0.22 Imported 2026-05-06
113 phi_2 0.20 Imported 2026-05-06
114 qwen2_1_5b_instruct 0.20 Imported 2026-05-06
115 yi_6b_chat 0.19 Imported 2026-05-06
116 vicuna_7b 0.19 Imported 2026-05-06
117 gemma_7b_it 0.19 Imported 2026-05-06
118 olmo_7b_instruct 0.16 Imported 2026-05-06
119 vicuna_7b_v1_5 0.15 Imported 2026-05-06
120 vicuna_13b 0.15 Imported 2026-05-06
121 gpt_neox_20b 0.14 Imported 2026-05-06
122 falcon_40b_instruct 0.13 Imported 2026-05-06
123 qwen1_5_4b_chat 0.13 Imported 2026-05-06
124 falcon_7b 0.11 Imported 2026-05-06
125 llama_2_7b_chat 0.11 Imported 2026-05-06
126 gpt_j_6b 0.10 Imported 2026-05-06
127 luminous_base_13b 0.08 Imported 2026-05-06
128 gemma_2b_it 0.08 Imported 2026-05-06
129 gemma_1_1_2b_it 0.07 Imported 2026-05-06
130 olmo_7b 0.06 Imported 2026-05-06
131 qwen1_5_1_8b_chat 0.06 Imported 2026-05-06
132 qwen2_0_5b_instruct 0.06 Imported 2026-05-06
133 pythia_12b 0.05 Imported 2026-05-06
134 chatglm2_6b 0.03 Imported 2026-05-06
135 pythia_6_9b 0.02 Imported 2026-05-06
136 qwen1_5_0_5b_chat 0.01 Imported 2026-05-06
137 falcon_7b_instruct 0.01 Imported 2026-05-06