PIQA

Physical Interaction Question Answering benchmark for physical commonsense reasoning.

16rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Standard error (lower is better)

Latest Results

Rows parsed from the public leaderboard table.

Rank Subject Score Model Match Provenance Sampled
1 gpt-4o-mini-2024-07-18 88.70 GPT-4o-mini (2024-07-18)
openai-gpt-4o-mini-2024-07-18
Imported 2026-05-06
2 Phi-3-mini-4k-instruct 88.60 Imported 2026-05-06
3 Gemini 1.5 Flash 87.50 Imported 2026-05-06
4 Llama 3.1 405B 85.90 Imported 2026-05-06
5 falcon-180B 84.90 Imported 2026-05-06
6 DeepSeek V3 84.70 DeepSeek V3
deepseek-deepseek-chat
Imported 2026-05-06
7 Gemma 3 27B 83.70 Gemma 3 27B
google-gemma-3-27b-it
Imported 2026-05-06
8 Mixtral-8x7B-v0.1 83.60 Imported 2026-05-06
9 Mistral Large 83.50 Mistral Large
mistralai-mistral-large
Imported 2026-05-06
10 Mistral-7B-v0.1 83 Imported 2026-05-06
11 Llama-2-70b-hf 82.80 Imported 2026-05-06
12 Qwen 2.5 72B 82.60 Qwen2.5 72B Instruct
qwen-qwen-2.5-72b-instruct
Imported 2026-05-06
13 Llama-2-7b 81.90 Imported 2026-05-06
14 gemma-7b 81.20 Imported 2026-05-06
15 Qwen 3 235B 79.90 Qwen3 235B A22B
qwen-qwen3-235b-a22b
Imported 2026-05-06
16 GPT-OSS 120B 76.70 gpt-oss-120b
openai-gpt-oss-120b
Imported 2026-05-06