needle-1M-bench

Centrally scored long-context needle-retrieval benchmark on dense scientific paper text, with haystacks from 50K through 1M tokens.

11rows
overall_recallprimary metric
2026-05-06sampled

Metadata

Metrics

Overall Recall, Paper-Anchored Recall, Synthetic Codes Recall, Haystack Tokens, Max Output Tokens, Depth Points

Latest Results

Rows are parsed from centrally scored public .eval_results YAML files. Recall values are converted to percentages; per-depth recall maps are preserved in metadata.

Rank Subject Overall Recall Model Match Provenance Sampled
1 deepseek-v4-pro 100 DeepSeek V4 Pro
deepseek-deepseek-v4-pro
Imported 2026-05-06
2 deepseek-v4-pro 100 DeepSeek V4 Pro
deepseek-deepseek-v4-pro
Imported 2026-05-06
3 gemini-2.5-pro 100 Gemini 2.5 Pro
google-gemini-2.5-pro
Imported 2026-05-06
4 qwen2.5-coder-14b-instruct-awq-int4 100 Imported 2026-05-06
5 qwen3-32b-awq-int4 100 Imported 2026-05-06
6 deepseek-v4-pro 100 DeepSeek V4 Pro
deepseek-deepseek-v4-pro
Imported 2026-05-06
7 deepseek-v4-pro 94 DeepSeek V4 Pro
deepseek-deepseek-v4-pro
Imported 2026-05-06
8 nemotron-3-nano-omni-30b-a3b-w4a16 90 Imported 2026-05-06
9 qwen3-14b-awq-int4 90 Imported 2026-05-06
10 qwen3-8b-awq-int4 80 Imported 2026-05-06
11 qwen3-4b-awq-int4 70 Imported 2026-05-06