SciPredict
SciPredict benchmarks LLMs on forecasting the outcomes of real scientific experiments across biology, chemistry, and physics.
15rows
scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Score, Confidence Interval Upper, Max Score
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | gemini-3-pro-preview | 25.27 | Gemini 3 google-gemini-3 | Imported | 2026-05-06 |
| 1 | claude-opus-4-5-20251101 | 23.05 | Claude Opus 4.5 anthropic-claude-opus-4.5 | Imported | 2026-05-06 |
| 1 | claude-opus-4-1-20250805 | 22.22 | Claude Opus 4.1 anthropic-claude-opus-4.1 | Imported | 2026-05-06 |
| 2 | claude-sonnet-4-5-20250929 | 22.55 | Claude Sonnet 4.5 anthropic-claude-sonnet-4.5 | Imported | 2026-05-06 |
| 2 | gemini-3-flash | 22.22 | Gemini 3 Flash Preview google-gemini-3-flash-preview | Imported | 2026-05-06 |
| 4 | gpt-5.2-2025-12-11 | 20.58 | GPT-5.2 openai-gpt-5.2 | Imported | 2026-05-06 |
| 4 | o3-mini-2025-01-31 | 19.84 | o3-mini openai-o3-mini | Imported | 2026-05-06 |
| 6 | DeepSeek-V3 | 19.18 | DeepSeek V3 deepseek-deepseek-chat | Imported | 2026-05-06 |
| 7 | Llama-3.3-70b | 18.19 | Llama 3.3 70B Instruct meta-llama-llama-3.3-70b-instruct | Imported | 2026-05-06 |
| 7 | o3-2025-04-03 | 17.94 | — | Imported | 2026-05-06 |
| 9 | gemini-2.5-pro | 17.04 | Gemini 2.5 Pro google-gemini-2.5-pro | Imported | 2026-05-06 |
| 9 | Qwen3-32B | 17.04 | Qwen3 32B qwen-qwen3-32b | Imported | 2026-05-06 |
| 9 | o4-mini-2025-04-03 | 16.21 | — | Imported | 2026-05-06 |
| 10 | Qwen3-235B | 16.63 | Qwen3 235B A22B qwen-qwen3-235b-a22b | Imported | 2026-05-06 |
| 14 | Llama-3.1-8B | 14.65 | — | Imported | 2026-05-06 |
No matching rows.