MedHELM
Clinically grounded medical evaluation suite built on HELM, covering medical reasoning, safety, fairness, robustness, and specialty-specific tasks.
9rows
mean_win_rateprimary metric
2026-05-27sampled
Metadata
Metrics
Mean Win Rate, Task Score Average
| Rank | Subject | Mean Win Rate | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | DeepSeek R1 | 0.6625 | R1 deepseek-r1 | Imported | 2026-05-27 |
| 2 | o3-mini (2025-01-31) | 0.6410714285714286 | o3-mini openai-o3-mini | Imported | 2026-05-27 |
| 3 | Claude 3.7 Sonnet (20250219) | 0.6357142857142857 | Claude 3.7 Sonnet anthropic-claude-3.7-sonnet | Imported | 2026-05-27 |
| 4 | Claude 3.5 Sonnet (20241022) | 0.6339285714285714 | Claude 3.5 Sonnet anthropic-claude-3.5-sonnet | Imported | 2026-05-27 |
| 5 | GPT-4o (2024-05-13) | 0.5696428571428571 | GPT-4o openai-gpt-4o | Imported | 2026-05-27 |
| 6 | Gemini 2.0 Flash | 0.41964285714285715 | Gemini 2.0 Flash google-gemini-2.0-flash | Imported | 2026-05-27 |
| 7 | GPT-4o mini (2024-07-18) | 0.39285714285714285 | GPT-4o-mini openai-gpt-4o-mini | Imported | 2026-05-27 |
| 8 | Llama 3.3 Instruct (70B) | 0.30357142857142855 | — | Imported | 2026-05-27 |
| 9 | Gemini 1.5 Pro (001) | 0.24107142857142858 | — | Imported | 2026-05-27 |
No matching rows.