SciPredict

SciPredict benchmarks LLMs on forecasting the outcomes of real scientific experiments across biology, chemistry, and physics.

15rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Confidence Interval Upper, Max Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 gemini-3-pro-preview 25.27 Gemini 3
google-gemini-3
Imported 2026-05-06
1 claude-opus-4-5-20251101 23.05 Claude Opus 4.5
anthropic-claude-opus-4.5
Imported 2026-05-06
1 claude-opus-4-1-20250805 22.22 Claude Opus 4.1
anthropic-claude-opus-4.1
Imported 2026-05-06
2 claude-sonnet-4-5-20250929 22.55 Claude Sonnet 4.5
anthropic-claude-sonnet-4.5
Imported 2026-05-06
2 gemini-3-flash 22.22 Gemini 3 Flash Preview
google-gemini-3-flash-preview
Imported 2026-05-06
4 gpt-5.2-2025-12-11 20.58 GPT-5.2
openai-gpt-5.2
Imported 2026-05-06
4 o3-mini-2025-01-31 19.84 o3-mini
openai-o3-mini
Imported 2026-05-06
6 DeepSeek-V3 19.18 DeepSeek V3
deepseek-deepseek-chat
Imported 2026-05-06
7 Llama-3.3-70b 18.19 Llama 3.3 70B Instruct
meta-llama-llama-3.3-70b-instruct
Imported 2026-05-06
7 o3-2025-04-03 17.94 Imported 2026-05-06
9 gemini-2.5-pro 17.04 Gemini 2.5 Pro
google-gemini-2.5-pro
Imported 2026-05-06
9 Qwen3-32B 17.04 Qwen3 32B
qwen-qwen3-32b
Imported 2026-05-06
9 o4-mini-2025-04-03 16.21 Imported 2026-05-06
10 Qwen3-235B 16.63 Qwen3 235B A22B
qwen-qwen3-235b-a22b
Imported 2026-05-06
14 Llama-3.1-8B 14.65 Imported 2026-05-06