Natural Questions
Natural Questions: Evaluates broad language-model knowledge, reasoning, commonsense, instruction following, or exam-style accuracy.
66rows
naturalquestions_open_book_f1primary metric
2026-05-27sampled
Metadata
Metrics
NaturalQuestions open-book F1, NaturalQuestions closed-book F1
| Rank | Subject | NaturalQuestions open-book F1 | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | text-davinci-003 | 77.022901% | — | Imported | 2026-05-27 |
| 2 | Cohere Command beta (52.4B) | 75.992227% | — | Imported | 2026-05-27 |
| 3 | Cohere Command beta (6.1B) | 71.749192% | — | Imported | 2026-05-27 |
| 4 | text-davinci-002 | 71.315853% | — | Imported | 2026-05-27 |
| 5 | MPT-Instruct (30B) | 69.71635% | — | Imported | 2026-05-27 |
| 6 | Mistral v0.1 (7B) | 68.656051% | — | Imported | 2026-05-27 |
| 7 | Vicuna v1.3 (13B) | 68.649803% | — | Imported | 2026-05-27 |
| 8 | Anthropic-LM v4-s3 (52B) | 68.639181% | — | Imported | 2026-05-27 |
| 9 | InstructPalmyra (30B) | 68.210608% | — | Imported | 2026-05-27 |
| 10 | Falcon (40B) | 67.525246% | — | Imported | 2026-05-27 |
| 11 | gpt-3.5-turbo-0613 | 67.477806% | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-27 |
| 12 | Llama 2 (70B) | 67.417439% | — | Imported | 2026-05-27 |
| 13 | MPT (30B) | 67.29406% | — | Imported | 2026-05-27 |
| 14 | LLaMA (65B) | 67.209534% | — | Imported | 2026-05-27 |
| 15 | Jurassic-2 Jumbo (178B) | 66.900014% | — | Imported | 2026-05-27 |
| 16 | Falcon-Instruct (40B) | 66.593484% | — | Imported | 2026-05-27 |
| 17 | LLaMA (30B) | 66.559555% | — | Imported | 2026-05-27 |
| 18 | RedPajama-INCITE-Instruct (7B) | 65.918407% | — | Imported | 2026-05-27 |
| 19 | Luminous Supreme (70B) | 64.856687% | — | Imported | 2026-05-27 |
| 20 | GLM (130B) | 64.243034% | — | Imported | 2026-05-27 |
| 21 | TNLG v2 (530B) | 64.214262% | — | Imported | 2026-05-27 |
| 22 | Jurassic-2 Grande (17B) | 63.945619% | — | Imported | 2026-05-27 |
| 23 | Llama 2 (13B) | 63.730249% | — | Imported | 2026-05-27 |
| 24 | RedPajama-INCITE-Instruct-v1 (3B) | 63.713554% | — | Imported | 2026-05-27 |
| 25 | Vicuna v1.3 (7B) | 63.392368% | — | Imported | 2026-05-27 |
| 26 | Cohere xlarge v20221108 (52.4B) | 62.848773% | — | Imported | 2026-05-27 |
| 27 | davinci (175B) | 62.45969% | — | Imported | 2026-05-27 |
| 28 | J1-Grande v2 beta (17B) | 62.454169% | — | Imported | 2026-05-27 |
| 29 | gpt-3.5-turbo-0301 | 62.433188% | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-27 |
| 30 | BLOOM (176B) | 62.13206% | — | Imported | 2026-05-27 |
| 31 | OPT (175B) | 61.487672% | — | Imported | 2026-05-27 |
| 32 | LLaMA (13B) | 61.434207% | — | Imported | 2026-05-27 |
| 33 | Llama 2 (7B) | 61.128489% | — | Imported | 2026-05-27 |
| 34 | Luminous Extended (30B) | 60.885459% | — | Imported | 2026-05-27 |
| 35 | GPT-NeoX (20B) | 59.608133% | — | Imported | 2026-05-27 |
| 36 | OPT (66B) | 59.597863% | — | Imported | 2026-05-27 |
| 37 | J1-Jumbo v1 (178B) | 59.521262% | — | Imported | 2026-05-27 |
| 38 | Cohere xlarge v20220609 (52.4B) | 59.514992% | — | Imported | 2026-05-27 |
| 39 | Alpaca (7B) | 59.246225% | — | Imported | 2026-05-27 |
| 40 | Jurassic-2 Large (7.5B) | 58.874707% | — | Imported | 2026-05-27 |
| 41 | LLaMA (7B) | 58.863482% | — | Imported | 2026-05-27 |
| 42 | RedPajama-INCITE-Base (7B) | 58.629834% | — | Imported | 2026-05-27 |
| 43 | Pythia (12B) | 58.082286% | — | Imported | 2026-05-27 |
| 44 | Falcon (7B) | 57.947678% | — | Imported | 2026-05-27 |
| 45 | J1-Grande v1 (17B) | 57.783382% | — | Imported | 2026-05-27 |
| 46 | Cohere large v20220720 (13.1B) | 57.325875% | — | Imported | 2026-05-27 |
| 47 | text-curie-001 | 57.135933% | — | Imported | 2026-05-27 |
| 48 | Luminous Base (13B) | 56.829019% | — | Imported | 2026-05-27 |
| 49 | TNLG v2 (6.7B) | 56.103213% | — | Imported | 2026-05-27 |
| 50 | GPT-J (6B) | 55.890657% | — | Imported | 2026-05-27 |
| 51 | curie (6.7B) | 55.150397% | — | Imported | 2026-05-27 |
| 52 | Pythia (6.9B) | 53.891341% | — | Imported | 2026-05-27 |
| 53 | J1-Large v1 (7.5B) | 53.217386% | — | Imported | 2026-05-27 |
| 54 | RedPajama-INCITE-Base-v1 (3B) | 51.995114% | — | Imported | 2026-05-27 |
| 55 | Cohere medium v20221108 (6.1B) | 51.698317% | — | Imported | 2026-05-27 |
| 56 | Cohere medium v20220720 (6.1B) | 50.407819% | — | Imported | 2026-05-27 |
| 57 | T5 (11B) | 47.732108% | — | Imported | 2026-05-27 |
| 58 | babbage (1.3B) | 45.128402% | — | Imported | 2026-05-27 |
| 59 | Falcon-Instruct (7B) | 44.887418% | — | Imported | 2026-05-27 |
| 60 | ada (350M) | 36.511147% | — | Imported | 2026-05-27 |
| 61 | UL2 (20B) | 34.921603% | — | Imported | 2026-05-27 |
| 62 | text-babbage-001 | 32.956771% | — | Imported | 2026-05-27 |
| 63 | Cohere small v20220720 (410M) | 30.94735% | — | Imported | 2026-05-27 |
| 64 | YaLM (100B) | 22.655585% | — | Imported | 2026-05-27 |
| 65 | T0pp (11B) | 18.97137% | — | Imported | 2026-05-27 |
| 66 | text-ada-001 | 14.883038% | — | Imported | 2026-05-27 |
No matching rows.