HellaSwag

Commonsense natural-language inference benchmark about grounded situations.

17rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Standard error (lower is better)

Latest Results

Rows parsed from the public leaderboard table.

Rank Subject Score Model Match Provenance Sampled
1 GPT-4.1 95.30 GPT-4.1
openai-gpt-4.1
Imported 2026-05-06
2 Llama 3.1 405B 89.20 Imported 2026-05-06
3 falcon-180B 89 Imported 2026-05-06
4 DeepSeek V3 88.90 DeepSeek V3
deepseek-deepseek-chat
Imported 2026-05-06
5 Mixtral-8x7B-v0.1 86.70 Imported 2026-05-06
6 Llama-2-70b-hf 85.30 Imported 2026-05-06
7 Qwen 2.5 72B 84.80 Qwen2.5 72B Instruct
qwen-qwen-2.5-72b-instruct
Imported 2026-05-06
8 Qwen2.5-Max 83 Imported 2026-05-06
9 Phi-3-medium-128k-instruct 82.40 Imported 2026-05-06
10 gemma-7b 82.20 Imported 2026-05-06
11 Mistral-7B-v0.1 81 Imported 2026-05-06
12 Llama-2-7b 80.70 Imported 2026-05-06
13 Phi-3-small-8k-instruct 77 Imported 2026-05-06
14 Phi-3-mini-4k-instruct 76.70 Imported 2026-05-06
15 Yi-6B 76.40 Imported 2026-05-06
16 GPT-OSS 120B 70.50 gpt-oss-120b
openai-gpt-oss-120b
Imported 2026-05-06
17 Phi-4 53.60 Phi 4
microsoft-phi-4
Imported 2026-05-06