Natural Questions

Natural Questions: Evaluates broad language-model knowledge, reasoning, commonsense, instruction following, or exam-style accuracy.

66rows
naturalquestions_open_book_f1primary metric
2026-05-27sampled

Metadata

Metrics

NaturalQuestions open-book F1, NaturalQuestions closed-book F1

Latest Results

Rows are ranked by NaturalQuestions open-book F1 from the HELM Classic core scenarios aggregate table.

Rank Subject NaturalQuestions open-book F1 Model Match Provenance Sampled
1 text-davinci-003 77.022901% Imported 2026-05-27
2 Cohere Command beta (52.4B) 75.992227% Imported 2026-05-27
3 Cohere Command beta (6.1B) 71.749192% Imported 2026-05-27
4 text-davinci-002 71.315853% Imported 2026-05-27
5 MPT-Instruct (30B) 69.71635% Imported 2026-05-27
6 Mistral v0.1 (7B) 68.656051% Imported 2026-05-27
7 Vicuna v1.3 (13B) 68.649803% Imported 2026-05-27
8 Anthropic-LM v4-s3 (52B) 68.639181% Imported 2026-05-27
9 InstructPalmyra (30B) 68.210608% Imported 2026-05-27
10 Falcon (40B) 67.525246% Imported 2026-05-27
11 gpt-3.5-turbo-0613 67.477806% GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-27
12 Llama 2 (70B) 67.417439% Imported 2026-05-27
13 MPT (30B) 67.29406% Imported 2026-05-27
14 LLaMA (65B) 67.209534% Imported 2026-05-27
15 Jurassic-2 Jumbo (178B) 66.900014% Imported 2026-05-27
16 Falcon-Instruct (40B) 66.593484% Imported 2026-05-27
17 LLaMA (30B) 66.559555% Imported 2026-05-27
18 RedPajama-INCITE-Instruct (7B) 65.918407% Imported 2026-05-27
19 Luminous Supreme (70B) 64.856687% Imported 2026-05-27
20 GLM (130B) 64.243034% Imported 2026-05-27
21 TNLG v2 (530B) 64.214262% Imported 2026-05-27
22 Jurassic-2 Grande (17B) 63.945619% Imported 2026-05-27
23 Llama 2 (13B) 63.730249% Imported 2026-05-27
24 RedPajama-INCITE-Instruct-v1 (3B) 63.713554% Imported 2026-05-27
25 Vicuna v1.3 (7B) 63.392368% Imported 2026-05-27
26 Cohere xlarge v20221108 (52.4B) 62.848773% Imported 2026-05-27
27 davinci (175B) 62.45969% Imported 2026-05-27
28 J1-Grande v2 beta (17B) 62.454169% Imported 2026-05-27
29 gpt-3.5-turbo-0301 62.433188% GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-27
30 BLOOM (176B) 62.13206% Imported 2026-05-27
31 OPT (175B) 61.487672% Imported 2026-05-27
32 LLaMA (13B) 61.434207% Imported 2026-05-27
33 Llama 2 (7B) 61.128489% Imported 2026-05-27
34 Luminous Extended (30B) 60.885459% Imported 2026-05-27
35 GPT-NeoX (20B) 59.608133% Imported 2026-05-27
36 OPT (66B) 59.597863% Imported 2026-05-27
37 J1-Jumbo v1 (178B) 59.521262% Imported 2026-05-27
38 Cohere xlarge v20220609 (52.4B) 59.514992% Imported 2026-05-27
39 Alpaca (7B) 59.246225% Imported 2026-05-27
40 Jurassic-2 Large (7.5B) 58.874707% Imported 2026-05-27
41 LLaMA (7B) 58.863482% Imported 2026-05-27
42 RedPajama-INCITE-Base (7B) 58.629834% Imported 2026-05-27
43 Pythia (12B) 58.082286% Imported 2026-05-27
44 Falcon (7B) 57.947678% Imported 2026-05-27
45 J1-Grande v1 (17B) 57.783382% Imported 2026-05-27
46 Cohere large v20220720 (13.1B) 57.325875% Imported 2026-05-27
47 text-curie-001 57.135933% Imported 2026-05-27
48 Luminous Base (13B) 56.829019% Imported 2026-05-27
49 TNLG v2 (6.7B) 56.103213% Imported 2026-05-27
50 GPT-J (6B) 55.890657% Imported 2026-05-27
51 curie (6.7B) 55.150397% Imported 2026-05-27
52 Pythia (6.9B) 53.891341% Imported 2026-05-27
53 J1-Large v1 (7.5B) 53.217386% Imported 2026-05-27
54 RedPajama-INCITE-Base-v1 (3B) 51.995114% Imported 2026-05-27
55 Cohere medium v20221108 (6.1B) 51.698317% Imported 2026-05-27
56 Cohere medium v20220720 (6.1B) 50.407819% Imported 2026-05-27
57 T5 (11B) 47.732108% Imported 2026-05-27
58 babbage (1.3B) 45.128402% Imported 2026-05-27
59 Falcon-Instruct (7B) 44.887418% Imported 2026-05-27
60 ada (350M) 36.511147% Imported 2026-05-27
61 UL2 (20B) 34.921603% Imported 2026-05-27
62 text-babbage-001 32.956771% Imported 2026-05-27
63 Cohere small v20220720 (410M) 30.94735% Imported 2026-05-27
64 YaLM (100B) 22.655585% Imported 2026-05-27
65 T0pp (11B) 18.97137% Imported 2026-05-27
66 text-ada-001 14.883038% Imported 2026-05-27