Natural Questions | BenchmarkList

Metadata

ID: natural_questions
Category: Intelligence
Release: Unknown
Source: Source page
Snapshot: Snapshot source

Metrics

NaturalQuestions open-book F1, NaturalQuestions closed-book F1

Rank	Subject	NaturalQuestions open-book F1	Model Match	Provenance	Sampled
1	text-davinci-003	77.022901%	—	Imported	2026-05-27
2	Cohere Command beta (52.4B)	75.992227%	—	Imported	2026-05-27
3	Cohere Command beta (6.1B)	71.749192%	—	Imported	2026-05-27
4	text-davinci-002	71.315853%	—	Imported	2026-05-27
5	MPT-Instruct (30B)	69.71635%	—	Imported	2026-05-27
6	Mistral v0.1 (7B)	68.656051%	—	Imported	2026-05-27
7	Vicuna v1.3 (13B)	68.649803%	—	Imported	2026-05-27
8	Anthropic-LM v4-s3 (52B)	68.639181%	—	Imported	2026-05-27
9	InstructPalmyra (30B)	68.210608%	—	Imported	2026-05-27
10	Falcon (40B)	67.525246%	—	Imported	2026-05-27
11	gpt-3.5-turbo-0613	67.477806%	GPT-3.5 Turbo openai-gpt-3.5-turbo	Imported	2026-05-27
12	Llama 2 (70B)	67.417439%	—	Imported	2026-05-27
13	MPT (30B)	67.29406%	—	Imported	2026-05-27
14	LLaMA (65B)	67.209534%	—	Imported	2026-05-27
15	Jurassic-2 Jumbo (178B)	66.900014%	—	Imported	2026-05-27
16	Falcon-Instruct (40B)	66.593484%	—	Imported	2026-05-27
17	LLaMA (30B)	66.559555%	—	Imported	2026-05-27
18	RedPajama-INCITE-Instruct (7B)	65.918407%	—	Imported	2026-05-27
19	Luminous Supreme (70B)	64.856687%	—	Imported	2026-05-27
20	GLM (130B)	64.243034%	—	Imported	2026-05-27
21	TNLG v2 (530B)	64.214262%	—	Imported	2026-05-27
22	Jurassic-2 Grande (17B)	63.945619%	—	Imported	2026-05-27
23	Llama 2 (13B)	63.730249%	—	Imported	2026-05-27
24	RedPajama-INCITE-Instruct-v1 (3B)	63.713554%	—	Imported	2026-05-27
25	Vicuna v1.3 (7B)	63.392368%	—	Imported	2026-05-27
26	Cohere xlarge v20221108 (52.4B)	62.848773%	—	Imported	2026-05-27
27	davinci (175B)	62.45969%	—	Imported	2026-05-27
28	J1-Grande v2 beta (17B)	62.454169%	—	Imported	2026-05-27
29	gpt-3.5-turbo-0301	62.433188%	GPT-3.5 Turbo openai-gpt-3.5-turbo	Imported	2026-05-27
30	BLOOM (176B)	62.13206%	—	Imported	2026-05-27
31	OPT (175B)	61.487672%	—	Imported	2026-05-27
32	LLaMA (13B)	61.434207%	—	Imported	2026-05-27
33	Llama 2 (7B)	61.128489%	—	Imported	2026-05-27
34	Luminous Extended (30B)	60.885459%	—	Imported	2026-05-27
35	GPT-NeoX (20B)	59.608133%	—	Imported	2026-05-27
36	OPT (66B)	59.597863%	—	Imported	2026-05-27
37	J1-Jumbo v1 (178B)	59.521262%	—	Imported	2026-05-27
38	Cohere xlarge v20220609 (52.4B)	59.514992%	—	Imported	2026-05-27
39	Alpaca (7B)	59.246225%	—	Imported	2026-05-27
40	Jurassic-2 Large (7.5B)	58.874707%	—	Imported	2026-05-27
41	LLaMA (7B)	58.863482%	—	Imported	2026-05-27
42	RedPajama-INCITE-Base (7B)	58.629834%	—	Imported	2026-05-27
43	Pythia (12B)	58.082286%	—	Imported	2026-05-27
44	Falcon (7B)	57.947678%	—	Imported	2026-05-27
45	J1-Grande v1 (17B)	57.783382%	—	Imported	2026-05-27
46	Cohere large v20220720 (13.1B)	57.325875%	—	Imported	2026-05-27
47	text-curie-001	57.135933%	—	Imported	2026-05-27
48	Luminous Base (13B)	56.829019%	—	Imported	2026-05-27
49	TNLG v2 (6.7B)	56.103213%	—	Imported	2026-05-27
50	GPT-J (6B)	55.890657%	—	Imported	2026-05-27
51	curie (6.7B)	55.150397%	—	Imported	2026-05-27
52	Pythia (6.9B)	53.891341%	—	Imported	2026-05-27
53	J1-Large v1 (7.5B)	53.217386%	—	Imported	2026-05-27
54	RedPajama-INCITE-Base-v1 (3B)	51.995114%	—	Imported	2026-05-27
55	Cohere medium v20221108 (6.1B)	51.698317%	—	Imported	2026-05-27
56	Cohere medium v20220720 (6.1B)	50.407819%	—	Imported	2026-05-27
57	T5 (11B)	47.732108%	—	Imported	2026-05-27
58	babbage (1.3B)	45.128402%	—	Imported	2026-05-27
59	Falcon-Instruct (7B)	44.887418%	—	Imported	2026-05-27
60	ada (350M)	36.511147%	—	Imported	2026-05-27
61	UL2 (20B)	34.921603%	—	Imported	2026-05-27
62	text-babbage-001	32.956771%	—	Imported	2026-05-27
63	Cohere small v20220720 (410M)	30.94735%	—	Imported	2026-05-27
64	YaLM (100B)	22.655585%	—	Imported	2026-05-27
65	T0pp (11B)	18.97137%	—	Imported	2026-05-27
66	text-ada-001	14.883038%	—	Imported	2026-05-27

Metadata

Metrics

Latest Results