Arena-Hard

Arena-Hard: Evaluates conversational quality, human preference, helpfulness, and pairwise response judgments.

28rows
scoreprimary metric
2026-05-27sampled

Metadata

Metrics

Scores, CI lower delta, CI upper delta

Latest Results

Rows are parsed from the official Arena-Hard README console table for Arena-Hard-v2.0-Preview with style control and Gemini-2.5 as judge.

Rank Subject Scores Model Match Provenance Sampled
1 o3-2025-04-16 85.9% o3
openai-o3
Imported 2026-05-27
2 o4-mini-2025-04-16-high 79.1% Imported 2026-05-27
3 gemini-2.5 79.0% Imported 2026-05-27
4 o4-mini-2025-04-16 74.6% o4 Mini
openai-o4-mini
Imported 2026-05-27
5 gemini-2.5-flash 68.6% Gemini 2.5 Flash
google-gemini-2.5-flash
Imported 2026-05-27
6 o3-mini-2025-01-31-high 66.1% o3 Mini High
openai-o3-mini-high
Imported 2026-05-27
7 o1-2024-12-17-high 61.0% Imported 2026-05-27
8 claude-3-7-sonnet-20250219-thinking-16k 59.8% Imported 2026-05-27
9 Qwen3-235B-A22B 58.4% Qwen3 235B A22B
qwen-qwen3-235b-a22b
Imported 2026-05-27
10 deepseek-r1 58.0% R1
deepseek-r1
Imported 2026-05-27
11 o1-2024-12-17 55.9% o1
openai-o1
Imported 2026-05-27
12 gpt-4.5-preview 50.0% GPT-4.5
openai-gpt-4.5-preview
Imported 2026-05-27
13 o3-mini-2025-01-31 50.0% o3-mini
openai-o3-mini
Imported 2026-05-27
14 gpt-4.1 50.0% GPT-4.1
openai-gpt-4.1
Imported 2026-05-27
15 gpt-4.1-mini 46.9% GPT-4.1 Mini
openai-gpt-4.1-mini
Imported 2026-05-27
16 Qwen3-32B 44.5% Qwen3 32B
qwen-qwen3-32b
Imported 2026-05-27
17 QwQ-32B 43.5% Imported 2026-05-27
18 Qwen3-30B-A3B 33.9% Qwen3 30B A3B
qwen-qwen3-30b-a3b
Imported 2026-05-27
19 claude-3-5-sonnet-20241022 33.0% Claude 3.5 Sonnet
anthropic-claude-3.5-sonnet
Imported 2026-05-27
20 s1.1-32B 22.3% Imported 2026-05-27
21 llama4-maverick-instruct-basic 17.2% Imported 2026-05-27
22 Athene-V2-Chat 16.4% Imported 2026-05-27
23 gemma-3-27b-it 15.0% Gemma 3 27B
google-gemma-3-27b-it
Imported 2026-05-27
24 Qwen3-4B 15.0% Imported 2026-05-27
25 gpt-4.1-nano 13.7% GPT-4.1 Nano
openai-gpt-4.1-nano
Imported 2026-05-27
26 Llama-3.1-Nemotron-70B-Instruct-HF 10.3% Imported 2026-05-27
27 Qwen2.5-72B-Instruct 10.1% Qwen2.5 72B Instruct
qwen-qwen-2.5-72b-instruct
Imported 2026-05-27
28 OpenThinker2-32B 3.2% Imported 2026-05-27