NarrativeQA

NarrativeQA: Measures long-context retrieval, needle finding, summarization, factual grounding, or retrieval-augmented generation quality.

66rows
f1primary metric
2026-05-27sampled

Metadata

Metrics

F1, ECE (10-bin) (lower is better), F1 (Robustness), F1 (Fairness), Denoised inference time (s) (lower is better), # eval

Latest Results

Rows are ranked by F1 from the aggregate HELM Classic narrative_qa group table.

Rank Subject F1 Model Match Provenance Sampled
1 Llama 2 (70B) 76.993143% Imported 2026-05-27
2 LLaMA (65B) 75.486594% Imported 2026-05-27
3 LLaMA (30B) 75.247492% Imported 2026-05-27
4 Cohere Command beta (52.4B) 75.167436% Imported 2026-05-27
5 Llama 2 (13B) 74.399923% Imported 2026-05-27
6 Palmyra X (43B) 74.187884% Imported 2026-05-27
7 Jurassic-2 Grande (17B) 73.67678% Imported 2026-05-27
8 Jurassic-2 Jumbo (178B) 73.317916% Imported 2026-05-27
9 MPT-Instruct (30B) 73.286082% Imported 2026-05-27
10 MPT (30B) 73.152584% Imported 2026-05-27
11 Anthropic-LM v4-s3 (52B) 72.842412% Imported 2026-05-27
12 text-davinci-002 72.717387% Imported 2026-05-27
13 text-davinci-003 72.706329% Imported 2026-05-27
14 J1-Grande v2 beta (17B) 72.532306% Imported 2026-05-27
15 TNLG v2 (530B) 72.194521% Imported 2026-05-27
16 Mistral v0.1 (7B) 71.645859% Imported 2026-05-27
17 LLaMA (13B) 71.116263% Imported 2026-05-27
18 Luminous Supreme (70B) 71.103058% Imported 2026-05-27
19 Cohere Command beta (6.1B) 70.921179% Imported 2026-05-27
20 GLM (130B) 70.59236% Imported 2026-05-27
21 J1-Jumbo v1 (178B) 69.504427% Imported 2026-05-27
22 Llama 2 (7B) 69.123281% Imported 2026-05-27
23 Vicuna v1.3 (13B) 69.069142% Imported 2026-05-27
24 davinci (175B) 68.687981% Imported 2026-05-27
25 Falcon (40B) 67.262999% Imported 2026-05-27
26 Cohere xlarge v20221108 (52.4B) 67.235984% Imported 2026-05-27
27 J1-Grande v1 (17B) 67.183769% Imported 2026-05-27
28 OPT (175B) 67.098924% Imported 2026-05-27
29 LLaMA (7B) 66.909449% Imported 2026-05-27
30 Luminous Extended (30B) 66.481763% Imported 2026-05-27
31 gpt-3.5-turbo-0301 66.304398% GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-27
32 BLOOM (176B) 66.205187% Imported 2026-05-27
33 Cohere xlarge v20220609 (52.4B) 65.013315% Imported 2026-05-27
34 Vicuna v1.3 (7B) 64.318002% Imported 2026-05-27
35 RedPajama-INCITE-Instruct (7B) 63.768167% Imported 2026-05-27
36 RedPajama-INCITE-Instruct-v1 (3B) 63.757678% Imported 2026-05-27
37 OPT (66B) 63.755182% Imported 2026-05-27
38 TNLG v2 (6.7B) 63.091322% Imported 2026-05-27
39 gpt-3.5-turbo-0613 62.507836% GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-27
40 Cohere large v20220720 (13.1B) 62.46782% Imported 2026-05-27
41 Falcon-Instruct (40B) 62.466006% Imported 2026-05-27
42 J1-Large v1 (7.5B) 62.33777% Imported 2026-05-27
43 Falcon (7B) 62.09479% Imported 2026-05-27
44 RedPajama-INCITE-Base (7B) 61.733571% Imported 2026-05-27
45 Cohere medium v20221108 (6.1B) 61.042163% Imported 2026-05-27
46 Luminous Base (13B) 60.490139% Imported 2026-05-27
47 curie (6.7B) 60.440873% Imported 2026-05-27
48 GPT-NeoX (20B) 59.891555% Imported 2026-05-27
49 Pythia (12B) 59.626738% Imported 2026-05-27
50 text-curie-001 58.19233% Imported 2026-05-27
51 Cohere medium v20220720 (6.1B) 55.903339% Imported 2026-05-27
52 RedPajama-INCITE-Base-v1 (3B) 55.511204% Imported 2026-05-27
53 GPT-J (6B) 54.469197% Imported 2026-05-27
54 Pythia (6.9B) 52.838757% Imported 2026-05-27
55 InstructPalmyra (30B) 49.645549% Imported 2026-05-27
56 babbage (1.3B) 49.134125% Imported 2026-05-27
57 Falcon-Instruct (7B) 47.631318% Imported 2026-05-27
58 text-babbage-001 42.935871% Imported 2026-05-27
59 Alpaca (7B) 39.598078% Imported 2026-05-27
60 ada (350M) 32.610438% Imported 2026-05-27
61 Cohere small v20220720 (410M) 29.366608% Imported 2026-05-27
62 YaLM (100B) 25.212195% Imported 2026-05-27
63 text-ada-001 23.810811% Imported 2026-05-27
64 T0pp (11B) 15.117621% Imported 2026-05-27
65 T5 (11B) 8.557844% Imported 2026-05-27
66 UL2 (20B) 8.275131% Imported 2026-05-27