HELM MedQA
HELM MedQA: Evaluates clinical, biomedical, medical-exam, coding, or healthcare-document reasoning.
13rows
exact_matchprimary metric
2026-05-28sampled
Metadata
Metrics
Exact match, Observed inference time (s) (lower is better), # eval
| Rank | Subject | Exact match | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GPT-5 (2025-08-07) | 0.968191 | GPT-5 openai-gpt-5 | Imported | 2026-05-28 |
| 2 | GPT-5 mini (2025-08-07) | 0.956262 | GPT-5 Mini openai-gpt-5-mini | Imported | 2026-05-28 |
| 3 | o4-mini (2025-04-16) | 0.948310 | o4 Mini openai-o4-mini | Imported | 2026-05-28 |
| 4 | Gemini 2.5 Pro (05-06 preview) | 0.934394 | Gemini 2.5 Pro google-gemini-2.5-pro | Imported | 2026-05-28 |
| 5 | o3-mini (2025-01-31) | 0.920477 | o3-mini openai-o3-mini | Imported | 2026-05-28 |
| 6 | GPT-4o (2024-05-13) | 0.876740 | GPT-4o openai-gpt-4o | Imported | 2026-05-28 |
| 7 | Claude 3.5 Sonnet (20241022) | 0.864811 | Claude 3.5 Sonnet anthropic-claude-3.5-sonnet | Imported | 2026-05-28 |
| 8 | Claude 3.7 Sonnet (20250219) | 0.856859 | Claude 3.7 Sonnet anthropic-claude-3.7-sonnet | Imported | 2026-05-28 |
| 9 | DeepSeek R1 | 0.856859 | R1 deepseek-r1 | Imported | 2026-05-28 |
| 10 | Gemini 2.0 Flash | 0.848907 | Gemini 2.0 Flash google-gemini-2.0-flash | Imported | 2026-05-28 |
| 11 | Llama 3.3 Instruct (70B) | 0.801193 | — | Imported | 2026-05-28 |
| 12 | Gemini 1.5 Pro (001) | 0.769384 | — | Imported | 2026-05-28 |
| 13 | GPT-4o mini (2024-07-18) | 0.749503 | GPT-4o-mini openai-gpt-4o-mini | Imported | 2026-05-28 |
No matching rows.