L-Eval
L-Eval: Measures long-context retrieval, needle finding, summarization, factual grounding, or retrieval-augmented generation quality.
3rows
average_scoreprimary metric
2026-05-27sampled
Metadata
Metrics
Average score, TOEFL, QuALITY, Coursera, SFiction, GSM, CodeU
| Rank | Subject | Average score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GPT4-32k (2023) | 73.111667% | GPT-4 openai-gpt-4 | Imported | 2026-05-27 |
| 2 | Llama3-70b-Instruct | 68.355% | Llama 3 70B Instruct meta-llama-llama-3-70b-instruct | Imported | 2026-05-27 |
| 3 | Llama3-8b-Instruct | 58.71% | Llama 3 8B Instruct meta-llama-llama-3-8b-instruct | Imported | 2026-05-27 |
No matching rows.