BoolQ
BoolQ: Evaluates broad language-model knowledge, reasoning, commonsense, instruction following, or exam-style accuracy.
67rows
exact_matchprimary metric
2026-05-27sampled
Metadata
Metrics
Exact match, ECE (10-bin) (lower is better), Exact match (Robustness), Exact match (Fairness), Denoised inference time (s) (lower is better), # eval
| Rank | Subject | Exact match | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Palmyra X (43B) | 89.633333% | — | Imported | 2026-05-27 |
| 2 | Llama 2 (70B) | 88.6% | — | Imported | 2026-05-27 |
| 3 | text-davinci-003 | 88.133333% | — | Imported | 2026-05-27 |
| 4 | text-davinci-002 | 87.7% | — | Imported | 2026-05-27 |
| 5 | Mistral v0.1 (7B) | 87.4% | — | Imported | 2026-05-27 |
| 6 | LLaMA (65B) | 87.1% | — | Imported | 2026-05-27 |
| 7 | gpt-3.5-turbo-0613 | 87% | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-27 |
| 8 | LLaMA (30B) | 86.1% | — | Imported | 2026-05-27 |
| 9 | Cohere Command beta (52.4B) | 85.633333% | — | Imported | 2026-05-27 |
| 10 | MPT-Instruct (30B) | 85% | — | Imported | 2026-05-27 |
| 11 | Falcon-Instruct (40B) | 82.9% | — | Imported | 2026-05-27 |
| 12 | Jurassic-2 Jumbo (178B) | 82.9% | — | Imported | 2026-05-27 |
| 13 | Jurassic-2 Grande (17B) | 82.6% | — | Imported | 2026-05-27 |
| 14 | Falcon (40B) | 81.9% | — | Imported | 2026-05-27 |
| 15 | Anthropic-LM v4-s3 (52B) | 81.533333% | — | Imported | 2026-05-27 |
| 16 | J1-Grande v2 beta (17B) | 81.233333% | — | Imported | 2026-05-27 |
| 17 | Llama 2 (13B) | 81.1% | — | Imported | 2026-05-27 |
| 18 | TNLG v2 (530B) | 80.933333% | — | Imported | 2026-05-27 |
| 19 | Vicuna v1.3 (13B) | 80.8% | — | Imported | 2026-05-27 |
| 20 | Cohere Command beta (6.1B) | 79.8% | — | Imported | 2026-05-27 |
| 21 | OPT (175B) | 79.3% | — | Imported | 2026-05-27 |
| 22 | GLM (130B) | 78.366667% | — | Imported | 2026-05-27 |
| 23 | Alpaca (7B) | 77.8% | — | Imported | 2026-05-27 |
| 24 | J1-Jumbo v1 (178B) | 77.566667% | — | Imported | 2026-05-27 |
| 25 | Luminous Supreme (70B) | 77.5% | — | Imported | 2026-05-27 |
| 26 | Luminous Extended (30B) | 76.666667% | — | Imported | 2026-05-27 |
| 27 | Llama 2 (7B) | 76.2% | — | Imported | 2026-05-27 |
| 28 | Cohere xlarge v20221108 (52.4B) | 76.166667% | — | Imported | 2026-05-27 |
| 29 | T5 (11B) | 76.1% | — | Imported | 2026-05-27 |
| 30 | OPT (66B) | 76.033333% | — | Imported | 2026-05-27 |
| 31 | Vicuna v1.3 (7B) | 76% | — | Imported | 2026-05-27 |
| 32 | LLaMA (7B) | 75.6% | — | Imported | 2026-05-27 |
| 33 | Falcon (7B) | 75.3% | — | Imported | 2026-05-27 |
| 34 | InstructPalmyra (30B) | 75.133333% | — | Imported | 2026-05-27 |
| 35 | UL2 (20B) | 74.566667% | — | Imported | 2026-05-27 |
| 36 | Jurassic-2 Large (7.5B) | 74.233333% | — | Imported | 2026-05-27 |
| 37 | gpt-3.5-turbo-0301 | 74% | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-27 |
| 38 | Cohere large v20220720 (13.1B) | 72.533333% | — | Imported | 2026-05-27 |
| 39 | davinci (175B) | 72.233333% | — | Imported | 2026-05-27 |
| 40 | J1-Grande v1 (17B) | 72.166667% | — | Imported | 2026-05-27 |
| 41 | Falcon-Instruct (7B) | 72% | — | Imported | 2026-05-27 |
| 42 | Luminous Base (13B) | 71.866667% | — | Imported | 2026-05-27 |
| 43 | Cohere xlarge v20220609 (52.4B) | 71.766667% | — | Imported | 2026-05-27 |
| 44 | LLaMA (13B) | 71.4% | — | Imported | 2026-05-27 |
| 45 | RedPajama-INCITE-Base (7B) | 71.3% | — | Imported | 2026-05-27 |
| 46 | RedPajama-INCITE-Instruct (7B) | 70.5% | — | Imported | 2026-05-27 |
| 47 | BLOOM (176B) | 70.4% | — | Imported | 2026-05-27 |
| 48 | MPT (30B) | 70.4% | — | Imported | 2026-05-27 |
| 49 | Cohere medium v20221108 (6.1B) | 70% | — | Imported | 2026-05-27 |
| 50 | TNLG v2 (6.7B) | 69.833333% | — | Imported | 2026-05-27 |
| 51 | RedPajama-INCITE-Base-v1 (3B) | 68.5% | — | Imported | 2026-05-27 |
| 52 | J1-Large v1 (7.5B) | 68.333333% | — | Imported | 2026-05-27 |
| 53 | GPT-NeoX (20B) | 68.266667% | — | Imported | 2026-05-27 |
| 54 | RedPajama-INCITE-Instruct-v1 (3B) | 67.7% | — | Imported | 2026-05-27 |
| 55 | Pythia (12B) | 66.2% | — | Imported | 2026-05-27 |
| 56 | Cohere medium v20220720 (6.1B) | 65.9% | — | Imported | 2026-05-27 |
| 57 | curie (6.7B) | 65.633333% | — | Imported | 2026-05-27 |
| 58 | GPT-J (6B) | 64.866667% | — | Imported | 2026-05-27 |
| 59 | YaLM (100B) | 63.4% | — | Imported | 2026-05-27 |
| 60 | Pythia (6.9B) | 63.1% | — | Imported | 2026-05-27 |
| 61 | text-curie-001 | 62.033333% | — | Imported | 2026-05-27 |
| 62 | ada (350M) | 58.1% | — | Imported | 2026-05-27 |
| 63 | babbage (1.3B) | 57.433333% | — | Imported | 2026-05-27 |
| 64 | text-ada-001 | 46.4% | — | Imported | 2026-05-27 |
| 65 | Cohere small v20220720 (410M) | 45.733333% | — | Imported | 2026-05-27 |
| 66 | text-babbage-001 | 45.1% | — | Imported | 2026-05-27 |
| 67 | T0pp (11B) | 0% | — | Imported | 2026-05-27 |
No matching rows.