Arena-Hard
Arena-Hard: Evaluates conversational quality, human preference, helpfulness, and pairwise response judgments.
28rows
scoreprimary metric
2026-05-27sampled
Metadata
Metrics
Scores, CI lower delta, CI upper delta
| Rank | Subject | Scores | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | o3-2025-04-16 | 85.9% | o3 openai-o3 | Imported | 2026-05-27 |
| 2 | o4-mini-2025-04-16-high | 79.1% | — | Imported | 2026-05-27 |
| 3 | gemini-2.5 | 79.0% | — | Imported | 2026-05-27 |
| 4 | o4-mini-2025-04-16 | 74.6% | o4 Mini openai-o4-mini | Imported | 2026-05-27 |
| 5 | gemini-2.5-flash | 68.6% | Gemini 2.5 Flash google-gemini-2.5-flash | Imported | 2026-05-27 |
| 6 | o3-mini-2025-01-31-high | 66.1% | o3 Mini High openai-o3-mini-high | Imported | 2026-05-27 |
| 7 | o1-2024-12-17-high | 61.0% | — | Imported | 2026-05-27 |
| 8 | claude-3-7-sonnet-20250219-thinking-16k | 59.8% | — | Imported | 2026-05-27 |
| 9 | Qwen3-235B-A22B | 58.4% | Qwen3 235B A22B qwen-qwen3-235b-a22b | Imported | 2026-05-27 |
| 10 | deepseek-r1 | 58.0% | R1 deepseek-r1 | Imported | 2026-05-27 |
| 11 | o1-2024-12-17 | 55.9% | o1 openai-o1 | Imported | 2026-05-27 |
| 12 | gpt-4.5-preview | 50.0% | GPT-4.5 openai-gpt-4.5-preview | Imported | 2026-05-27 |
| 13 | o3-mini-2025-01-31 | 50.0% | o3-mini openai-o3-mini | Imported | 2026-05-27 |
| 14 | gpt-4.1 | 50.0% | GPT-4.1 openai-gpt-4.1 | Imported | 2026-05-27 |
| 15 | gpt-4.1-mini | 46.9% | GPT-4.1 Mini openai-gpt-4.1-mini | Imported | 2026-05-27 |
| 16 | Qwen3-32B | 44.5% | Qwen3 32B qwen-qwen3-32b | Imported | 2026-05-27 |
| 17 | QwQ-32B | 43.5% | — | Imported | 2026-05-27 |
| 18 | Qwen3-30B-A3B | 33.9% | Qwen3 30B A3B qwen-qwen3-30b-a3b | Imported | 2026-05-27 |
| 19 | claude-3-5-sonnet-20241022 | 33.0% | Claude 3.5 Sonnet anthropic-claude-3.5-sonnet | Imported | 2026-05-27 |
| 20 | s1.1-32B | 22.3% | — | Imported | 2026-05-27 |
| 21 | llama4-maverick-instruct-basic | 17.2% | — | Imported | 2026-05-27 |
| 22 | Athene-V2-Chat | 16.4% | — | Imported | 2026-05-27 |
| 23 | gemma-3-27b-it | 15.0% | Gemma 3 27B google-gemma-3-27b-it | Imported | 2026-05-27 |
| 24 | Qwen3-4B | 15.0% | — | Imported | 2026-05-27 |
| 25 | gpt-4.1-nano | 13.7% | GPT-4.1 Nano openai-gpt-4.1-nano | Imported | 2026-05-27 |
| 26 | Llama-3.1-Nemotron-70B-Instruct-HF | 10.3% | — | Imported | 2026-05-27 |
| 27 | Qwen2.5-72B-Instruct | 10.1% | Qwen2.5 72B Instruct qwen-qwen-2.5-72b-instruct | Imported | 2026-05-27 |
| 28 | OpenThinker2-32B | 3.2% | — | Imported | 2026-05-27 |
No matching rows.