HealthBench

HealthBench: Evaluates clinical, biomedical, medical-exam, coding, or healthcare-document reasoning.

5rows
mean_scoreprimary metric
2026-05-27sampled

Metadata

Metrics

Mean score, Minimum score, Maximum score, Standard deviation (lower is better), Runs

Latest Results

Rows are parsed from the OpenAI HealthBench paper arXiv source table reporting overall HealthBench score variability across 16 runs. Mean score is used as the primary score; min, max, standard deviation, and run count are preserved.

Rank Subject Mean score Model Match Provenance Sampled
1 o3 0.5990 o3
openai-o3
Imported 2026-05-27
2 GPT-4.1 0.4778 GPT-4.1
openai-gpt-4.1
Imported 2026-05-27
3 o1 0.4200 o1
openai-o1
Imported 2026-05-27
4 GPT-4o (Aug 2024) 0.3233 GPT-4o
openai-gpt-4o
Imported 2026-05-27
5 GPT-3.5 Turbo 0.1554 GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-27