GSO-Bench

General science and observation benchmark for frontier model capability tracking.

10rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Standard error (lower is better)

Latest Results

Rows parsed from the public leaderboard table.

Rank Subject Score Model Match Provenance Sampled
1 GPT-5.2 27.40 GPT-5.2
openai-gpt-5.2
Imported 2026-05-06
2 Claude Opus 4.5 26.50 Claude Opus 4.5
anthropic-claude-opus-4.5
Imported 2026-05-06
3 Gemini 3 Pro 18.60 Gemini 3
google-gemini-3
Imported 2026-05-06
4 o3 8.80 o3
openai-o3
Imported 2026-05-06
5 kimi-k2-thinking (official) 4.90 KIMI MoonshotAI: Kimi K2 Thinking
moonshotai-kimi-k2-thinking
Imported 2026-05-06
6 Qwen3-Max-Instruct 4.90 Qwen3 Max
qwen-qwen3-max
Imported 2026-05-06
7 Claude 3.7 Sonnet 4.60 Claude 3.7 Sonnet
anthropic-claude-3.7-sonnet
Imported 2026-05-06
8 Gemini 2.5 Pro (Jun 2025) 3.90 Gemini 2.5 Pro
google-gemini-2.5-pro
Imported 2026-05-06
9 o4-mini (high) 3.60 o4 Mini High
openai-o4-mini-high
Imported 2026-05-06
10 GPT-4o 0 GPT-4o
openai-gpt-4o
Imported 2026-05-06