BoolQ

BoolQ: Evaluates broad language-model knowledge, reasoning, commonsense, instruction following, or exam-style accuracy.

67rows
exact_matchprimary metric
2026-05-27sampled

Metadata

Metrics

Exact match, ECE (10-bin) (lower is better), Exact match (Robustness), Exact match (Fairness), Denoised inference time (s) (lower is better), # eval

Latest Results

Rows are ranked by Exact match from the aggregate HELM Classic boolq group table.

Rank Subject Exact match Model Match Provenance Sampled
1 Palmyra X (43B) 89.633333% Imported 2026-05-27
2 Llama 2 (70B) 88.6% Imported 2026-05-27
3 text-davinci-003 88.133333% Imported 2026-05-27
4 text-davinci-002 87.7% Imported 2026-05-27
5 Mistral v0.1 (7B) 87.4% Imported 2026-05-27
6 LLaMA (65B) 87.1% Imported 2026-05-27
7 gpt-3.5-turbo-0613 87% GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-27
8 LLaMA (30B) 86.1% Imported 2026-05-27
9 Cohere Command beta (52.4B) 85.633333% Imported 2026-05-27
10 MPT-Instruct (30B) 85% Imported 2026-05-27
11 Falcon-Instruct (40B) 82.9% Imported 2026-05-27
12 Jurassic-2 Jumbo (178B) 82.9% Imported 2026-05-27
13 Jurassic-2 Grande (17B) 82.6% Imported 2026-05-27
14 Falcon (40B) 81.9% Imported 2026-05-27
15 Anthropic-LM v4-s3 (52B) 81.533333% Imported 2026-05-27
16 J1-Grande v2 beta (17B) 81.233333% Imported 2026-05-27
17 Llama 2 (13B) 81.1% Imported 2026-05-27
18 TNLG v2 (530B) 80.933333% Imported 2026-05-27
19 Vicuna v1.3 (13B) 80.8% Imported 2026-05-27
20 Cohere Command beta (6.1B) 79.8% Imported 2026-05-27
21 OPT (175B) 79.3% Imported 2026-05-27
22 GLM (130B) 78.366667% Imported 2026-05-27
23 Alpaca (7B) 77.8% Imported 2026-05-27
24 J1-Jumbo v1 (178B) 77.566667% Imported 2026-05-27
25 Luminous Supreme (70B) 77.5% Imported 2026-05-27
26 Luminous Extended (30B) 76.666667% Imported 2026-05-27
27 Llama 2 (7B) 76.2% Imported 2026-05-27
28 Cohere xlarge v20221108 (52.4B) 76.166667% Imported 2026-05-27
29 T5 (11B) 76.1% Imported 2026-05-27
30 OPT (66B) 76.033333% Imported 2026-05-27
31 Vicuna v1.3 (7B) 76% Imported 2026-05-27
32 LLaMA (7B) 75.6% Imported 2026-05-27
33 Falcon (7B) 75.3% Imported 2026-05-27
34 InstructPalmyra (30B) 75.133333% Imported 2026-05-27
35 UL2 (20B) 74.566667% Imported 2026-05-27
36 Jurassic-2 Large (7.5B) 74.233333% Imported 2026-05-27
37 gpt-3.5-turbo-0301 74% GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-27
38 Cohere large v20220720 (13.1B) 72.533333% Imported 2026-05-27
39 davinci (175B) 72.233333% Imported 2026-05-27
40 J1-Grande v1 (17B) 72.166667% Imported 2026-05-27
41 Falcon-Instruct (7B) 72% Imported 2026-05-27
42 Luminous Base (13B) 71.866667% Imported 2026-05-27
43 Cohere xlarge v20220609 (52.4B) 71.766667% Imported 2026-05-27
44 LLaMA (13B) 71.4% Imported 2026-05-27
45 RedPajama-INCITE-Base (7B) 71.3% Imported 2026-05-27
46 RedPajama-INCITE-Instruct (7B) 70.5% Imported 2026-05-27
47 BLOOM (176B) 70.4% Imported 2026-05-27
48 MPT (30B) 70.4% Imported 2026-05-27
49 Cohere medium v20221108 (6.1B) 70% Imported 2026-05-27
50 TNLG v2 (6.7B) 69.833333% Imported 2026-05-27
51 RedPajama-INCITE-Base-v1 (3B) 68.5% Imported 2026-05-27
52 J1-Large v1 (7.5B) 68.333333% Imported 2026-05-27
53 GPT-NeoX (20B) 68.266667% Imported 2026-05-27
54 RedPajama-INCITE-Instruct-v1 (3B) 67.7% Imported 2026-05-27
55 Pythia (12B) 66.2% Imported 2026-05-27
56 Cohere medium v20220720 (6.1B) 65.9% Imported 2026-05-27
57 curie (6.7B) 65.633333% Imported 2026-05-27
58 GPT-J (6B) 64.866667% Imported 2026-05-27
59 YaLM (100B) 63.4% Imported 2026-05-27
60 Pythia (6.9B) 63.1% Imported 2026-05-27
61 text-curie-001 62.033333% Imported 2026-05-27
62 ada (350M) 58.1% Imported 2026-05-27
63 babbage (1.3B) 57.433333% Imported 2026-05-27
64 text-ada-001 46.4% Imported 2026-05-27
65 Cohere small v20220720 (410M) 45.733333% Imported 2026-05-27
66 text-babbage-001 45.1% Imported 2026-05-27
67 T0pp (11B) 0% Imported 2026-05-27