LongBench v2
LongBench v2: Measures long-context retrieval, needle finding, summarization, factual grounding, or retrieval-augmented generation quality.
38rows
overall_cot_accuracyprimary metric
2026-05-27sampled
Metadata
Metrics
Overall w/ CoT, Overall, Easy w/ CoT, Easy, Hard w/ CoT, Hard, Short w/ CoT, Short, Medium w/ CoT, Medium, Long w/ CoT, Long
| Rank | Subject | Overall w/ CoT | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Gemini-2.5-Pro | 63.3% | Gemini 2.5 Pro google-gemini-2.5-pro | Imported | 2026-05-27 |
| 2 | Gemini-2.5-Flash | 62.1% | Gemini 2.5 Flash google-gemini-2.5-flash | Imported | 2026-05-27 |
| 3 | Qwen3-235B-A22B-Thinking-2507 Alibaba | 60.6% | — | Imported | 2026-05-27 |
| 4 | DeepSeek-R1 | 58.3% | R1 deepseek-r1 | Imported | 2026-05-27 |
| 5 | Qwen3-235B-A22B-Instruct-2507 Alibaba | 58.3% | — | Imported | 2026-05-27 |
| 6 | o1-preview | 57.7% | o1-preview openai-o1-preview | Imported | 2026-05-27 |
| 7 | DeepSeek-R1-0528 | 56.7% | R1 0528 deepseek-deepseek-r1-0528 | Imported | 2026-05-27 |
| 8 | MiniMax-Text-01 | 56.5% | — | Imported | 2026-05-27 |
| 9 | Gemini-2.0-Flash-Thinking | 56% | — | Imported | 2026-05-27 |
| 10 | Human | 53.7% | — | Imported | 2026-05-27 |
| 11 | Gemini-Exp-1206 | 52.5% | — | Imported | 2026-05-27 |
| 12 | GPT-4o | 51.4% | GPT-4o openai-gpt-4o | Imported | 2026-05-27 |
| 13 | GPT-4o | 51.2% | GPT-4o openai-gpt-4o | Imported | 2026-05-27 |
| 14 | Gemini-2.0-Flash | 51.1% | Gemini 2.0 Flash google-gemini-2.0-flash | Imported | 2026-05-27 |
| 15 | GLM-4.5 Z.ai & Tsinghua | 50.3% | — | Imported | 2026-05-27 |
| 16 | Qwen3-235B-A22B Alibaba | 50.1% | — | Imported | 2026-05-27 |
| 17 | Qwen3-30B-A3B-Thinking-2507 Alibaba | 50.1% | — | Imported | 2026-05-27 |
| 18 | Qwen3-32B Alibaba | 49.2% | — | Imported | 2026-05-27 |
| 19 | QwQ-32B Alibaba | 48.9% | — | Imported | 2026-05-27 |
| 20 | GLM-4.5-Air Z.ai & Tsinghua | 48.6% | — | Imported | 2026-05-27 |
| 21 | Claude 3.5 Sonnet Anthropic | 46.7% | — | Imported | 2026-05-27 |
| 22 | GLM-4-Plus Z.ai & Tsinghua | 46.1% | — | Imported | 2026-05-27 |
| 23 | Kimi-K2-Instruct Moonshot AI | 44.3% | — | Imported | 2026-05-27 |
| 24 | Qwen2.5-72B Alibaba | 43.5% | — | Imported | 2026-05-27 |
| 25 | Qwen3-30B-A3B Alibaba | 42.5% | — | Imported | 2026-05-27 |
| 26 | Mistral Large 24.11 Mistral AI | 39.6% | — | Imported | 2026-05-27 |
| 27 | o1-mini OpenAI | 38.9% | — | Imported | 2026-05-27 |
| 28 | Llama 3.1 70B Meta | 36.2% | — | Imported | 2026-05-27 |
| 29 | Llama 3.3 70B Meta | 36.2% | — | Imported | 2026-05-27 |
| 30 | Qwen2.5-7B Alibaba | 35.6% | — | Imported | 2026-05-27 |
| 31 | Nemotron 70B Nvidia | 35.2% | — | Imported | 2026-05-27 |
| 32 | Mistral Large 2 Mistral AI | 33.6% | — | Imported | 2026-05-27 |
| 33 | GPT-4o mini OpenAI | 32.4% | GPT-4o-mini openai-gpt-4o-mini | Imported | 2026-05-27 |
| 34 | NExtLong 8B CAS | 32% | — | Imported | 2026-05-27 |
| 35 | Command R+ Cohere | 31.6% | — | Imported | 2026-05-27 |
| 36 | GLM-4-9B Z.ai & Tsinghua | 30.8% | — | Imported | 2026-05-27 |
| 37 | Llama 3.1 8B Meta | 30.4% | — | Imported | 2026-05-27 |
| 38 | Random | 25% | — | Imported | 2026-05-27 |
No matching rows.