Global PIQA

Global PIQA is a multilingual commonsense reasoning benchmark that evaluates physical interaction knowledge across 100 languages and cultures. It tests AI systems' understanding of physical world knowledge in diverse cultural contexts through multiple choice questions about everyday situations requiring physical commonsense.

17rows
scoreprimary metric
2026-05-28sampled

Metadata

Metrics

Score, Normalized Score

Showing 2 latest source slices.

Latest Results

Provider-published Qwen3.7-Max comparison scores. Rows are marked self-reported and should be interpreted as source claims unless independently reproduced.

Rank Subject Score Model Match Provenance Sampled
1 Qwen3.7 Max 91.4% Qwen3.7 Max
qwen-qwen3.7-max
Self-reported 2026-05-28
2 Claude Opus 4.6 Max 91.2% Claude Opus 4.6
anthropic-claude-opus-4.6
Self-reported 2026-05-28
3 DeepSeek V4 Pro Max 90.5% DeepSeek V4 Pro
deepseek-deepseek-v4-pro
Self-reported 2026-05-28
4 Qwen3.6 Plus 89.8% Qwen3.6 Plus
qwen-qwen3.6-plus
Self-reported 2026-05-28
5 GLM-5.1 Thinking 89.5% GLM GLM 5.1
z-ai-glm-5.1
Self-reported 2026-05-28
6 Kimi K2.6 Thinking 89.2% KIMI MoonshotAI: Kimi K2.6
moonshotai-kimi-k2.6
Self-reported 2026-05-28
1 Gemini 3 Pro 0.93 Gemini 3
google-gemini-3
Self-reported 2026-05-06
2 Gemini 3 Flash 0.93 Gemini 3 Flash Preview
google-gemini-3-flash-preview
Self-reported 2026-05-06
3 Qwen3.6 Plus 0.90 Qwen3.6 Plus
qwen-qwen3.6-plus
Self-reported 2026-05-06
3 Qwen3.5-397B-A17B 0.90 Qwen3.5 397B A17B
qwen-qwen3.5-397b-a17b
Self-reported 2026-05-06
5 Qwen3.5-122B-A10B 0.88 Qwen3.5-122B-A10B
qwen-qwen3.5-122b-a10b
Self-reported 2026-05-06
6 Qwen3.5-27B 0.88 Qwen3.5-27B
qwen-qwen3.5-27b
Self-reported 2026-05-06
7 Qwen3.5-35B-A3B 0.87 Qwen3.5-35B-A3B
qwen-qwen3.5-35b-a3b
Self-reported 2026-05-06
8 Qwen3.5-9B 0.83 Qwen3.5-9B
qwen-qwen3.5-9b
Self-reported 2026-05-06
9 Qwen3.5-4B 0.79 Self-reported 2026-05-06
10 Qwen3.5-2B 0.69 Self-reported 2026-05-06
11 Qwen3.5-0.8B 0.59 Self-reported 2026-05-06