QuantSightBench
Numerical forecasting benchmark evaluating whether LLMs produce calibrated 90% prediction intervals for 1,000 real-world questions under zero-shot, grounded, and agentic retrieval settings.
11rows
coverageprimary metric
2026-05-28sampled
Metadata
Metrics
Coverage, Mean Log IS (lower is better)
| Rank | Subject | Coverage | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Gemini 3.1 Pro | 0.7910 coverage | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Imported | 2026-05-28 |
| 2 | Grok 4 | 0.7638 coverage | Grok 4 x-ai-grok-4 | Imported | 2026-05-28 |
| 3 | GPT-5.4 | 0.7533 coverage | GPT-5.4 openai-gpt-5.4 | Imported | 2026-05-28 |
| 4 | GPT-5.1 | 0.7459 coverage | GPT-5.1 openai-gpt-5.1 | Imported | 2026-05-28 |
| 5 | Opus 4.6 | 0.7360 coverage | — | Imported | 2026-05-28 |
| 6 | Opus 4.5 | 0.7166 coverage | — | Imported | 2026-05-28 |
| 7 | Sonnet 4.5 | 0.6796 coverage | — | Imported | 2026-05-28 |
| 8 | Kimi K2 Thinking | 0.6579 coverage | MoonshotAI: Kimi K2 Thinking moonshotai-kimi-k2-thinking | Imported | 2026-05-28 |
| 9 | Gemini 3 Pro | 0.6543 coverage | Gemini 3 google-gemini-3 | Imported | 2026-05-28 |
| 10 | GLM-4.7 | 0.6269 coverage | GLM 4.7 z-ai-glm-4.7 | Imported | 2026-05-28 |
| 11 | DeepSeek v3.2 | 0.6148 coverage | DeepSeek V3.2 deepseek-deepseek-v3.2 | Imported | 2026-05-28 |
No matching rows.