FreshQA

FreshQA: Measures long-context retrieval, needle finding, summarization, factual grounding, or retrieval-augmented generation quality.

25rows
relaxed_accuracyprimary metric
2026-05-27sampled

Metadata

Metrics

Relaxed accuracy, Strict accuracy

Latest Results

Rows are transcribed from the public FreshLLMs paper. Primary score is overall relaxed FreshQA accuracy from Table 5.

Rank Subject Relaxed accuracy Model Match Provenance Sampled
1 GPT-4 46.4% GPT-4
openai-gpt-4
Imported 2026-05-27
2 ChatGPT 41.4% Imported 2026-05-27
3 GPT-3.5 32.4% Imported 2026-05-27
4 OpenAI Codex 25.6% Imported 2026-05-27
5 Flan-PaLM 540B 23.6% Imported 2026-05-27
6 PaLM 540B + chain-of-thought 22.8% Imported 2026-05-27
7 PaLM 540B + few-shot 20.2% Imported 2026-05-27
8 PaLMChilla 62B 15.0% Imported 2026-05-27
9 PaLM 62B + few-shot 14.2% Imported 2026-05-27
10 T5 XXL 11B + chain-of-thought 13.0% Imported 2026-05-27
11 PaLM 62B + chain-of-thought 12.8% Imported 2026-05-27
12 PaLM 540B 12.2% Imported 2026-05-27
13 PaLM 8B + chain-of-thought 11.4% Imported 2026-05-27
14 T5 XXL 11B 10.8% Imported 2026-05-27
15 PaLM 8B + few-shot 9.2% Imported 2026-05-27
16 T5 XXL 11B + few-shot 9.0% Imported 2026-05-27
17 PaLM 8B 8.8% Imported 2026-05-27
18 PaLM 62B 8.6% Imported 2026-05-27
19 Flan-T5 XXL 11B 7.2% Imported 2026-05-27
20 T5 XL 3B + few-shot 6.0% Imported 2026-05-27
21 T5 XL 3B 5.8% Imported 2026-05-27
22 T5 XL 3B + chain-of-thought 5.2% Imported 2026-05-27
23 T5 Large 770M 4.4% Imported 2026-05-27
24 T5 Large 770M + chain-of-thought 2.2% Imported 2026-05-27
25 T5 Large 770M + few-shot 0.8% Imported 2026-05-27