L-Eval

L-Eval: Measures long-context retrieval, needle finding, summarization, factual grounding, or retrieval-augmented generation quality.

3rows
average_scoreprimary metric
2026-05-27sampled

Metadata

Metrics

Average score, TOEFL, QuALITY, Coursera, SFiction, GSM, CodeU

Latest Results

Rows are parsed from the public L-Eval README closed-ended task table. The primary score is an unweighted average across the published task scores.

Rank Subject Average score Model Match Provenance Sampled
1 GPT4-32k (2023) 73.111667% GPT-4
openai-gpt-4
Imported 2026-05-27
2 Llama3-70b-Instruct 68.355% Llama 3 70B Instruct
meta-llama-llama-3-70b-instruct
Imported 2026-05-27
3 Llama3-8b-Instruct 58.71% Llama 3 8B Instruct
meta-llama-llama-3-8b-instruct
Imported 2026-05-27