WinoGrande

Large-scale Winograd schema challenge for commonsense pronoun-resolution reasoning.

21rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Standard error (lower is better)

Latest Results

Rows parsed from the public leaderboard table.

Rank Subject Score Model Match Provenance Sampled
1 Llama 3.1 405B 89.20 Imported 2026-05-06
2 Claude 3 Opus 88.50 Imported 2026-05-06
3 GPT-4.1 87.50 GPT-4.1
openai-gpt-4.1
Imported 2026-05-06
4 falcon-180B 87.10 Imported 2026-05-06
5 DeepSeek V3 86.30 DeepSeek V3
deepseek-deepseek-chat
Imported 2026-05-06
6 Meta-Llama-3-8B-Instruct 83.50 Llama 3 8B Instruct
meta-llama-llama-3-8b-instruct
Imported 2026-05-06
7 Qwen 2.5 72B 82.30 Qwen2.5 72B Instruct
qwen-qwen-2.5-72b-instruct
Imported 2026-05-06
8 gpt-3.5-turbo-1106 81.60 GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-06
9 Phi-3-medium-128k-instruct 81.50 Imported 2026-05-06
10 Phi-3-small-8k-instruct 81.50 Imported 2026-05-06
11 Qwen2.5-Max 80.80 Imported 2026-05-06
12 Llama-2-70b-hf 80.20 Imported 2026-05-06
13 gemma-7b 79 Imported 2026-05-06
14 Mixtral-8x7B-v0.1 77.20 Imported 2026-05-06
15 Llama-2-7b 76.70 Imported 2026-05-06
16 Mistral-7B-v0.1 75.30 Imported 2026-05-06
17 Claude 3.7 Sonnet 75.10 Claude 3.7 Sonnet
anthropic-claude-3.7-sonnet
Imported 2026-05-06
18 Phi-4 73.40 Phi 4
microsoft-phi-4
Imported 2026-05-06
19 Yi-6B 73 Imported 2026-05-06
20 Phi-3-mini-4k-instruct 70.80 Imported 2026-05-06
21 GPT-OSS 120B 66.10 gpt-oss-120b
openai-gpt-oss-120b
Imported 2026-05-06