SimpleQA

SimpleQA: Evaluates broad language-model knowledge, reasoning, commonsense, instruction following, or exam-style accuracy.

23rows
simpleqa_accuracyprimary metric
2026-05-27sampled

Metadata

Metrics

SimpleQA

Latest Results

Rows are imported from the OpenAI simple-evals README benchmark result table. The repository deprecation notice says it will no longer update new model results after July 2025.

Rank Subject SimpleQA Model Match Provenance Sampled
1 gpt-4.5-preview-2025-02-27 62.5% GPT-4.5
openai-gpt-4.5-preview
Imported 2026-05-27
2 o3 [^9] [^10] 49.4% Imported 2026-05-27
3 o3-low [^10] 49.4% Imported 2026-05-27
4 o3-high [^10] 48.6% Imported 2026-05-27
5 o1 42.6% o1
openai-o1
Imported 2026-05-27
6 o1-preview 42.4% o1-preview
openai-o1-preview
Imported 2026-05-27
7 gpt-4.1-2025-04-14 41.6% GPT-4.1
openai-gpt-4.1
Imported 2026-05-27
8 gpt-4o-2024-08-06 40.1% GPT-4o
openai-gpt-4o
Imported 2026-05-27
9 gpt-4o-2024-05-13 39% GPT-4o
openai-gpt-4o
Imported 2026-05-27
10 gpt-4o-2024-11-20 38.8% GPT-4o
openai-gpt-4o
Imported 2026-05-27
11 [Claude 3.5 Sonnet](https://www.anthropic.com/news/claude-3-5-sonnet) 28.9% Claude 3.5 Sonnet
anthropic-claude-3.5-sonnet
Imported 2026-05-27
12 gpt-4-turbo-2024-04-09 24.2% GPT-4 Turbo
openai-gpt-4-turbo
Imported 2026-05-27
13 [Claude 3 Opus](https://www.anthropic.com/news/claude-3-family) 23.5% Imported 2026-05-27
14 o4-mini [^9] [^10] 20.2% Imported 2026-05-27
15 o4-mini-low [^10] 20.2% Imported 2026-05-27
16 o4-mini-high [^9] [^10] 19.3% Imported 2026-05-27
17 gpt-4.1-mini-2025-04-14 16.8% GPT-4.1 Mini
openai-gpt-4.1-mini
Imported 2026-05-27
18 o3-mini-high 13.8% o3 Mini High
openai-o3-mini-high
Imported 2026-05-27
19 o3-mini 13.4% o3-mini
openai-o3-mini
Imported 2026-05-27
20 o3-mini-low 13% Imported 2026-05-27
21 gpt-4o-mini-2024-07-18 9.5% GPT-4o-mini
openai-gpt-4o-mini
Imported 2026-05-27
22 o1-mini 7.6% Imported 2026-05-27
23 gpt-4.1-nano-2025-04-14 7.6% GPT-4.1 Nano
openai-gpt-4.1-nano
Imported 2026-05-27