NeedleBench

NeedleBench: Measures long-context retrieval, needle finding, summarization, factual grounding, or retrieval-augmented generation quality.

12rows
overall_128k_scoreprimary metric
2026-05-27sampled

Metadata

Metrics

Overall 128K score, Single-retrieval overall, Multi-retrieval overall, Multi-reasoning overall

Latest Results

Rows are transcribed from the public NeedleBench paper Table 2. Primary score is the 128K overall score.

Rank Subject Overall 128K score Model Match Provenance Sampled
1 Qwen-2.5-72B 81.02% Qwen2.5 72B Instruct
qwen-qwen-2.5-72b-instruct
Imported 2026-05-27
2 Gemma-3-27B 80.38% Gemma 3 27B
google-gemma-3-27b-it
Imported 2026-05-27
3 Qwen-2.5-32B 78.25% Imported 2026-05-27
4 InternLM3-8B 75.49% Imported 2026-05-27
5 Gemma-3-12B 75.31% Gemma 3 12B
google-gemma-3-12b-it
Imported 2026-05-27
6 Qwen-2.5-14B 73.96% Imported 2026-05-27
7 LLaMA-3.1-70B 72.37% Imported 2026-05-27
8 LLaMA-3.1-8B 70.98% Imported 2026-05-27
9 Qwen-2.5-7B 70.75% Imported 2026-05-27
10 InternLM2.5-7B-Chat-1M 69.17% Imported 2026-05-27
11 GLM-4-9B-Chat 66.51% Imported 2026-05-27
12 Gemma-3-4B 64.42% Gemma 3 4B
google-gemma-3-4b-it
Imported 2026-05-27