MedHELM

Clinically grounded medical evaluation suite built on HELM, covering medical reasoning, safety, fairness, robustness, and specialty-specific tasks.

9rows
mean_win_rateprimary metric
2026-05-27sampled

Metadata

Metrics

Mean Win Rate, Task Score Average

Latest Results

Rows parsed from the public MedHELM v2.0.0 HELM-generated Accuracy group JSON. Task-level scores are retained in row metadata.

Rank Subject Mean Win Rate Model Match Provenance Sampled
1 DeepSeek R1 0.6625 R1
deepseek-r1
Imported 2026-05-27
2 o3-mini (2025-01-31) 0.6410714285714286 o3-mini
openai-o3-mini
Imported 2026-05-27
3 Claude 3.7 Sonnet (20250219) 0.6357142857142857 Claude 3.7 Sonnet
anthropic-claude-3.7-sonnet
Imported 2026-05-27
4 Claude 3.5 Sonnet (20241022) 0.6339285714285714 Claude 3.5 Sonnet
anthropic-claude-3.5-sonnet
Imported 2026-05-27
5 GPT-4o (2024-05-13) 0.5696428571428571 GPT-4o
openai-gpt-4o
Imported 2026-05-27
6 Gemini 2.0 Flash 0.41964285714285715 Gemini 2.0 Flash
google-gemini-2.0-flash
Imported 2026-05-27
7 GPT-4o mini (2024-07-18) 0.39285714285714285 GPT-4o-mini
openai-gpt-4o-mini
Imported 2026-05-27
8 Llama 3.3 Instruct (70B) 0.30357142857142855 Imported 2026-05-27
9 Gemini 1.5 Pro (001) 0.24107142857142858 Imported 2026-05-27