FreshQA | BenchmarkList

Metadata

Relaxed accuracy, Strict accuracy

Rank	Subject	Relaxed accuracy	Model Match	Provenance	Sampled
1	GPT-4	46.4%	GPT-4 openai-gpt-4	Imported	2026-05-27
2	ChatGPT	41.4%	—	Imported	2026-05-27
3	GPT-3.5	32.4%	—	Imported	2026-05-27
4	OpenAI Codex	25.6%	—	Imported	2026-05-27
5	Flan-PaLM 540B	23.6%	—	Imported	2026-05-27
6	PaLM 540B + chain-of-thought	22.8%	—	Imported	2026-05-27
7	PaLM 540B + few-shot	20.2%	—	Imported	2026-05-27
8	PaLMChilla 62B	15.0%	—	Imported	2026-05-27
9	PaLM 62B + few-shot	14.2%	—	Imported	2026-05-27
10	T5 XXL 11B + chain-of-thought	13.0%	—	Imported	2026-05-27
11	PaLM 62B + chain-of-thought	12.8%	—	Imported	2026-05-27
12	PaLM 540B	12.2%	—	Imported	2026-05-27
13	PaLM 8B + chain-of-thought	11.4%	—	Imported	2026-05-27
14	T5 XXL 11B	10.8%	—	Imported	2026-05-27
15	PaLM 8B + few-shot	9.2%	—	Imported	2026-05-27
16	T5 XXL 11B + few-shot	9.0%	—	Imported	2026-05-27
17	PaLM 8B	8.8%	—	Imported	2026-05-27
18	PaLM 62B	8.6%	—	Imported	2026-05-27
19	Flan-T5 XXL 11B	7.2%	—	Imported	2026-05-27
20	T5 XL 3B + few-shot	6.0%	—	Imported	2026-05-27
21	T5 XL 3B	5.8%	—	Imported	2026-05-27
22	T5 XL 3B + chain-of-thought	5.2%	—	Imported	2026-05-27
23	T5 Large 770M	4.4%	—	Imported	2026-05-27
24	T5 Large 770M + chain-of-thought	2.2%	—	Imported	2026-05-27
25	T5 Large 770M + few-shot	0.8%	—	Imported	2026-05-27