QuantSightBench

Numerical forecasting benchmark evaluating whether LLMs produce calibrated 90% prediction intervals for 1,000 real-world questions under zero-shot, grounded, and agentic retrieval settings.

11rows
coverageprimary metric
2026-05-28sampled

Metadata

Metrics

Coverage, Mean Log IS (lower is better)

Latest Results

Rows are imported from the official QuantSightBench static data.js leaderboard. The main leaderboard reports agentic retrieval at high reasoning effort.

Rank Subject Coverage Model Match Provenance Sampled
1 Gemini 3.1 Pro 0.7910 coverage Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-28
2 Grok 4 0.7638 coverage GROK Grok 4
x-ai-grok-4
Imported 2026-05-28
3 GPT-5.4 0.7533 coverage GPT-5.4
openai-gpt-5.4
Imported 2026-05-28
4 GPT-5.1 0.7459 coverage GPT-5.1
openai-gpt-5.1
Imported 2026-05-28
5 Opus 4.6 0.7360 coverage Imported 2026-05-28
6 Opus 4.5 0.7166 coverage Imported 2026-05-28
7 Sonnet 4.5 0.6796 coverage Imported 2026-05-28
8 Kimi K2 Thinking 0.6579 coverage KIMI MoonshotAI: Kimi K2 Thinking
moonshotai-kimi-k2-thinking
Imported 2026-05-28
9 Gemini 3 Pro 0.6543 coverage Gemini 3
google-gemini-3
Imported 2026-05-28
10 GLM-4.7 0.6269 coverage GLM GLM 4.7
z-ai-glm-4.7
Imported 2026-05-28
11 DeepSeek v3.2 0.6148 coverage DeepSeek V3.2
deepseek-deepseek-v3.2
Imported 2026-05-28