NarrativeQA
NarrativeQA: Measures long-context retrieval, needle finding, summarization, factual grounding, or retrieval-augmented generation quality.
66rows
f1primary metric
2026-05-27sampled
Metadata
Metrics
F1, ECE (10-bin) (lower is better), F1 (Robustness), F1 (Fairness), Denoised inference time (s) (lower is better), # eval
| Rank | Subject | F1 | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Llama 2 (70B) | 76.993143% | — | Imported | 2026-05-27 |
| 2 | LLaMA (65B) | 75.486594% | — | Imported | 2026-05-27 |
| 3 | LLaMA (30B) | 75.247492% | — | Imported | 2026-05-27 |
| 4 | Cohere Command beta (52.4B) | 75.167436% | — | Imported | 2026-05-27 |
| 5 | Llama 2 (13B) | 74.399923% | — | Imported | 2026-05-27 |
| 6 | Palmyra X (43B) | 74.187884% | — | Imported | 2026-05-27 |
| 7 | Jurassic-2 Grande (17B) | 73.67678% | — | Imported | 2026-05-27 |
| 8 | Jurassic-2 Jumbo (178B) | 73.317916% | — | Imported | 2026-05-27 |
| 9 | MPT-Instruct (30B) | 73.286082% | — | Imported | 2026-05-27 |
| 10 | MPT (30B) | 73.152584% | — | Imported | 2026-05-27 |
| 11 | Anthropic-LM v4-s3 (52B) | 72.842412% | — | Imported | 2026-05-27 |
| 12 | text-davinci-002 | 72.717387% | — | Imported | 2026-05-27 |
| 13 | text-davinci-003 | 72.706329% | — | Imported | 2026-05-27 |
| 14 | J1-Grande v2 beta (17B) | 72.532306% | — | Imported | 2026-05-27 |
| 15 | TNLG v2 (530B) | 72.194521% | — | Imported | 2026-05-27 |
| 16 | Mistral v0.1 (7B) | 71.645859% | — | Imported | 2026-05-27 |
| 17 | LLaMA (13B) | 71.116263% | — | Imported | 2026-05-27 |
| 18 | Luminous Supreme (70B) | 71.103058% | — | Imported | 2026-05-27 |
| 19 | Cohere Command beta (6.1B) | 70.921179% | — | Imported | 2026-05-27 |
| 20 | GLM (130B) | 70.59236% | — | Imported | 2026-05-27 |
| 21 | J1-Jumbo v1 (178B) | 69.504427% | — | Imported | 2026-05-27 |
| 22 | Llama 2 (7B) | 69.123281% | — | Imported | 2026-05-27 |
| 23 | Vicuna v1.3 (13B) | 69.069142% | — | Imported | 2026-05-27 |
| 24 | davinci (175B) | 68.687981% | — | Imported | 2026-05-27 |
| 25 | Falcon (40B) | 67.262999% | — | Imported | 2026-05-27 |
| 26 | Cohere xlarge v20221108 (52.4B) | 67.235984% | — | Imported | 2026-05-27 |
| 27 | J1-Grande v1 (17B) | 67.183769% | — | Imported | 2026-05-27 |
| 28 | OPT (175B) | 67.098924% | — | Imported | 2026-05-27 |
| 29 | LLaMA (7B) | 66.909449% | — | Imported | 2026-05-27 |
| 30 | Luminous Extended (30B) | 66.481763% | — | Imported | 2026-05-27 |
| 31 | gpt-3.5-turbo-0301 | 66.304398% | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-27 |
| 32 | BLOOM (176B) | 66.205187% | — | Imported | 2026-05-27 |
| 33 | Cohere xlarge v20220609 (52.4B) | 65.013315% | — | Imported | 2026-05-27 |
| 34 | Vicuna v1.3 (7B) | 64.318002% | — | Imported | 2026-05-27 |
| 35 | RedPajama-INCITE-Instruct (7B) | 63.768167% | — | Imported | 2026-05-27 |
| 36 | RedPajama-INCITE-Instruct-v1 (3B) | 63.757678% | — | Imported | 2026-05-27 |
| 37 | OPT (66B) | 63.755182% | — | Imported | 2026-05-27 |
| 38 | TNLG v2 (6.7B) | 63.091322% | — | Imported | 2026-05-27 |
| 39 | gpt-3.5-turbo-0613 | 62.507836% | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-27 |
| 40 | Cohere large v20220720 (13.1B) | 62.46782% | — | Imported | 2026-05-27 |
| 41 | Falcon-Instruct (40B) | 62.466006% | — | Imported | 2026-05-27 |
| 42 | J1-Large v1 (7.5B) | 62.33777% | — | Imported | 2026-05-27 |
| 43 | Falcon (7B) | 62.09479% | — | Imported | 2026-05-27 |
| 44 | RedPajama-INCITE-Base (7B) | 61.733571% | — | Imported | 2026-05-27 |
| 45 | Cohere medium v20221108 (6.1B) | 61.042163% | — | Imported | 2026-05-27 |
| 46 | Luminous Base (13B) | 60.490139% | — | Imported | 2026-05-27 |
| 47 | curie (6.7B) | 60.440873% | — | Imported | 2026-05-27 |
| 48 | GPT-NeoX (20B) | 59.891555% | — | Imported | 2026-05-27 |
| 49 | Pythia (12B) | 59.626738% | — | Imported | 2026-05-27 |
| 50 | text-curie-001 | 58.19233% | — | Imported | 2026-05-27 |
| 51 | Cohere medium v20220720 (6.1B) | 55.903339% | — | Imported | 2026-05-27 |
| 52 | RedPajama-INCITE-Base-v1 (3B) | 55.511204% | — | Imported | 2026-05-27 |
| 53 | GPT-J (6B) | 54.469197% | — | Imported | 2026-05-27 |
| 54 | Pythia (6.9B) | 52.838757% | — | Imported | 2026-05-27 |
| 55 | InstructPalmyra (30B) | 49.645549% | — | Imported | 2026-05-27 |
| 56 | babbage (1.3B) | 49.134125% | — | Imported | 2026-05-27 |
| 57 | Falcon-Instruct (7B) | 47.631318% | — | Imported | 2026-05-27 |
| 58 | text-babbage-001 | 42.935871% | — | Imported | 2026-05-27 |
| 59 | Alpaca (7B) | 39.598078% | — | Imported | 2026-05-27 |
| 60 | ada (350M) | 32.610438% | — | Imported | 2026-05-27 |
| 61 | Cohere small v20220720 (410M) | 29.366608% | — | Imported | 2026-05-27 |
| 62 | YaLM (100B) | 25.212195% | — | Imported | 2026-05-27 |
| 63 | text-ada-001 | 23.810811% | — | Imported | 2026-05-27 |
| 64 | T0pp (11B) | 15.117621% | — | Imported | 2026-05-27 |
| 65 | T5 (11B) | 8.557844% | — | Imported | 2026-05-27 |
| 66 | UL2 (20B) | 8.275131% | — | Imported | 2026-05-27 |
No matching rows.