PhysicianBench

Long-horizon physician workflow benchmark grounded in clinical records, measuring checkpoint and end-to-end task success.

12rows
pass_at_1primary metric
2026-05-27sampled

Metadata

Metrics

Pass@1, Pass@1 SD (lower is better), Pass@3, Pass^3, Mean Turns (lower is better)

Latest Results

Rows parsed from the public PhysicianBench leaderboard. Metrics are averaged over three independent runs on 100 clinician-validated EHR tasks.

Rank Subject Pass@1 Model Match Provenance Sampled
1 GPT-5.5 46.3 +/- 1.2 GPT-5.5
openai-gpt-5.5
Imported 2026-05-27
2 Claude Opus 4.6 31.7 +/- 2.3 Claude Opus 4.6
anthropic-claude-opus-4.6
Imported 2026-05-27
3 Claude Opus 4.7 29.3 +/- 2.5 Claude Opus 4.7
anthropic-claude-opus-4.7
Imported 2026-05-27
4 GPT-5.4 27.7 +/- 1.5 GPT-5.4
openai-gpt-5.4
Imported 2026-05-27
5 Claude Sonnet 4.6 23.0 +/- 2.6 Claude Sonnet 4.6
anthropic-claude-sonnet-4.6
Imported 2026-05-27
6 DeepSeek V4-Pro 18.7 +/- 2.9 DeepSeek V4 Pro
deepseek-deepseek-v4-pro
Imported 2026-05-27
7 Kimi-K2.6 17.0 +/- 2.6 KIMI MoonshotAI: Kimi K2.6
moonshotai-kimi-k2.6
Imported 2026-05-27
8 MiMo-v2.5-Pro 16.7 +/- 4.0 MiMo-V2.5-Pro
xiaomi-mimo-v2.5-pro
Imported 2026-05-27
9 Qwen3.6-Plus 13.7 +/- 4.0 Qwen3.6 Plus
qwen-qwen3.6-plus
Imported 2026-05-27
10 MiniMax M2.7 8.7 +/- 1.2 MiniMax M2.7
minimax-minimax-m2.7
Imported 2026-05-27
11 Gemini Pro 3.1 6.0 +/- 1.0 Imported 2026-05-27
12 Grok-4.20 5.3 +/- 3.2 GROK Grok 4.20
x-ai-grok-4.20
Imported 2026-05-27