InfiniteBench
InfiniteBench: Measures long-context retrieval, needle finding, summarization, factual grounding, or retrieval-augmented generation quality.
7rows
average_scoreprimary metric
2026-05-27sampled
Metadata
Metrics
Average score, Retrieve.PassKey, Retrieve.Number, Retrieve.KV, En.Sum, En.QA, En.MC, En.Dia, Zh.QA, Code.Debug, Code.Run, Math.Calc, Math.Find
| Rank | Subject | Average score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GPT-4 | 46.099167% | GPT-4 openai-gpt-4 | Imported | 2026-05-27 |
| 2 | Claude 2 | 37.843333% | — | Imported | 2026-05-27 |
| 3 | Kimi-Chat | 35.325% | — | Imported | 2026-05-27 |
| 4 | Yi-34B-200K | 27.406667% | — | Imported | 2026-05-27 |
| 5 | Yi-6B-200K | 24.584167% | — | Imported | 2026-05-27 |
| 6 | YaRN-Mistral-7B | 21.460833% | — | Imported | 2026-05-27 |
| 7 | ChatGLM-3-6B-128K | 19.4525% | — | Imported | 2026-05-27 |
No matching rows.