InfiniteBench

InfiniteBench: Measures long-context retrieval, needle finding, summarization, factual grounding, or retrieval-augmented generation quality.

7rows
average_scoreprimary metric
2026-05-27sampled

Metadata

Metrics

Average score, Retrieve.PassKey, Retrieve.Number, Retrieve.KV, En.Sum, En.QA, En.MC, En.Dia, Zh.QA, Code.Debug, Code.Run, Math.Calc, Math.Find

Latest Results

Rows are parsed from the public InfiniteBench README table. The table reports mixed task metrics as percentages; the primary score is an unweighted average across the published task scores.

Rank Subject Average score Model Match Provenance Sampled
1 GPT-4 46.099167% GPT-4
openai-gpt-4
Imported 2026-05-27
2 Claude 2 37.843333% Imported 2026-05-27
3 Kimi-Chat 35.325% Imported 2026-05-27
4 Yi-34B-200K 27.406667% Imported 2026-05-27
5 Yi-6B-200K 24.584167% Imported 2026-05-27
6 YaRN-Mistral-7B 21.460833% Imported 2026-05-27
7 ChatGLM-3-6B-128K 19.4525% Imported 2026-05-27