Arena-Hard v2

Arena-Hard-Auto v2 is a challenging benchmark consisting of 500 carefully curated prompts sourced from Chatbot Arena and WildChat-1M, designed to evaluate large language models on real-world user queries. The benchmark covers diverse domains including open-ended software engineering problems, mathematics, creative writing, and technical problem-solving. It uses LLM-as-a-Judge for automatic evaluation, achieving 98.6% correlation with human preference rankings while providing 3x higher separation of model performances compared to MT-Bench. The benchmark emphasizes prompt specificity, complexity, and domain knowledge to better distinguish between model capabilities.

16rows

scoreprimary metric

2026-05-06sampled

Metadata

ID: arena_hard_v2
Category: General Knowledge
Release: Unknown
Source: Source page
Snapshot: Snapshot source

Metrics

Score, Normalized Score

Rank	Subject	Score	Model Match	Provenance	Sampled
1	MiMo-V2-Flash	0.86	MiMo-V2-Flash xiaomi-mimo-v2-flash	Self-reported	2026-05-06
2	Qwen3-Next-80B-A3B-Instruct	0.83	Qwen3 Next 80B A3B Instruct qwen-qwen3-next-80b-a3b-instruct	Self-reported	2026-05-06
3	Qwen3-235B-A22B-Thinking-2507	0.80	Qwen3 235B A22B Thinking 2507 qwen-qwen3-235b-a22b-thinking-2507	Self-reported	2026-05-06
4	Qwen3-235B-A22B-Instruct-2507	0.79	Qwen3 235B A22B Instruct 2507 qwen-qwen3-235b-a22b-2507	Self-reported	2026-05-06
5	Qwen3 VL 235B A22B Instruct	0.77	Qwen3 VL 235B A22B Instruct qwen-qwen3-vl-235b-a22b-instruct	Self-reported	2026-05-06
6	Nemotron 3 Super (120B A12B)	0.74	Nemotron 3 Super nvidia-nemotron-3-super-120b-a12b	Self-reported	2026-05-06
7	Sarvam-105B	0.71	—	Self-reported	2026-05-06
8	Nemotron 3 Nano (30B A3B)	0.68	Nemotron 3 Nano 30B A3B nvidia-nemotron-3-nano-30b-a3b	Self-reported	2026-05-06
9	Qwen3 VL 32B Instruct	0.65	Qwen3 VL 32B Instruct qwen-qwen3-vl-32b-instruct	Self-reported	2026-05-06
10	Qwen3-Next-80B-A3B-Thinking	0.62	Qwen3 Next 80B A3B Thinking qwen-qwen3-next-80b-a3b-thinking	Self-reported	2026-05-06
11	Qwen3 VL 32B Thinking	0.60	—	Self-reported	2026-05-06
12	Qwen3 VL 30B A3B Instruct	0.58	Qwen3 VL 30B A3B Instruct qwen-qwen3-vl-30b-a3b-instruct	Self-reported	2026-05-06
13	Qwen3 VL 30B A3B Thinking	0.57	Qwen3 VL 30B A3B Thinking qwen-qwen3-vl-30b-a3b-thinking	Self-reported	2026-05-06
14	Qwen3 VL 8B Thinking	0.51	Qwen3 VL 8B Thinking qwen-qwen3-vl-8b-thinking	Self-reported	2026-05-06
15	Sarvam-30B	0.49	—	Self-reported	2026-05-06
16	Qwen3 VL 4B Thinking	0.37	—	Self-reported	2026-05-06

Metadata

Metrics

Latest Results