FreshQA
FreshQA: Measures long-context retrieval, needle finding, summarization, factual grounding, or retrieval-augmented generation quality.
25rows
relaxed_accuracyprimary metric
2026-05-27sampled
Metadata
Metrics
Relaxed accuracy, Strict accuracy
| Rank | Subject | Relaxed accuracy | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GPT-4 | 46.4% | GPT-4 openai-gpt-4 | Imported | 2026-05-27 |
| 2 | ChatGPT | 41.4% | — | Imported | 2026-05-27 |
| 3 | GPT-3.5 | 32.4% | — | Imported | 2026-05-27 |
| 4 | OpenAI Codex | 25.6% | — | Imported | 2026-05-27 |
| 5 | Flan-PaLM 540B | 23.6% | — | Imported | 2026-05-27 |
| 6 | PaLM 540B + chain-of-thought | 22.8% | — | Imported | 2026-05-27 |
| 7 | PaLM 540B + few-shot | 20.2% | — | Imported | 2026-05-27 |
| 8 | PaLMChilla 62B | 15.0% | — | Imported | 2026-05-27 |
| 9 | PaLM 62B + few-shot | 14.2% | — | Imported | 2026-05-27 |
| 10 | T5 XXL 11B + chain-of-thought | 13.0% | — | Imported | 2026-05-27 |
| 11 | PaLM 62B + chain-of-thought | 12.8% | — | Imported | 2026-05-27 |
| 12 | PaLM 540B | 12.2% | — | Imported | 2026-05-27 |
| 13 | PaLM 8B + chain-of-thought | 11.4% | — | Imported | 2026-05-27 |
| 14 | T5 XXL 11B | 10.8% | — | Imported | 2026-05-27 |
| 15 | PaLM 8B + few-shot | 9.2% | — | Imported | 2026-05-27 |
| 16 | T5 XXL 11B + few-shot | 9.0% | — | Imported | 2026-05-27 |
| 17 | PaLM 8B | 8.8% | — | Imported | 2026-05-27 |
| 18 | PaLM 62B | 8.6% | — | Imported | 2026-05-27 |
| 19 | Flan-T5 XXL 11B | 7.2% | — | Imported | 2026-05-27 |
| 20 | T5 XL 3B + few-shot | 6.0% | — | Imported | 2026-05-27 |
| 21 | T5 XL 3B | 5.8% | — | Imported | 2026-05-27 |
| 22 | T5 XL 3B + chain-of-thought | 5.2% | — | Imported | 2026-05-27 |
| 23 | T5 Large 770M | 4.4% | — | Imported | 2026-05-27 |
| 24 | T5 Large 770M + chain-of-thought | 2.2% | — | Imported | 2026-05-27 |
| 25 | T5 Large 770M + few-shot | 0.8% | — | Imported | 2026-05-27 |
No matching rows.