HealthBench
HealthBench: Evaluates clinical, biomedical, medical-exam, coding, or healthcare-document reasoning.
5rows
mean_scoreprimary metric
2026-05-27sampled
Metadata
Metrics
Mean score, Minimum score, Maximum score, Standard deviation (lower is better), Runs
| Rank | Subject | Mean score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | o3 | 0.5990 | o3 openai-o3 | Imported | 2026-05-27 |
| 2 | GPT-4.1 | 0.4778 | GPT-4.1 openai-gpt-4.1 | Imported | 2026-05-27 |
| 3 | o1 | 0.4200 | o1 openai-o1 | Imported | 2026-05-27 |
| 4 | GPT-4o (Aug 2024) | 0.3233 | GPT-4o openai-gpt-4o | Imported | 2026-05-27 |
| 5 | GPT-3.5 Turbo | 0.1554 | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-27 |
No matching rows.