PhysicianBench
Long-horizon physician workflow benchmark grounded in clinical records, measuring checkpoint and end-to-end task success.
12rows
pass_at_1primary metric
2026-05-27sampled
Metadata
Metrics
Pass@1, Pass@1 SD (lower is better), Pass@3, Pass^3, Mean Turns (lower is better)
| Rank | Subject | Pass@1 | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GPT-5.5 | 46.3 +/- 1.2 | GPT-5.5 openai-gpt-5.5 | Imported | 2026-05-27 |
| 2 | Claude Opus 4.6 | 31.7 +/- 2.3 | Claude Opus 4.6 anthropic-claude-opus-4.6 | Imported | 2026-05-27 |
| 3 | Claude Opus 4.7 | 29.3 +/- 2.5 | Claude Opus 4.7 anthropic-claude-opus-4.7 | Imported | 2026-05-27 |
| 4 | GPT-5.4 | 27.7 +/- 1.5 | GPT-5.4 openai-gpt-5.4 | Imported | 2026-05-27 |
| 5 | Claude Sonnet 4.6 | 23.0 +/- 2.6 | Claude Sonnet 4.6 anthropic-claude-sonnet-4.6 | Imported | 2026-05-27 |
| 6 | DeepSeek V4-Pro | 18.7 +/- 2.9 | DeepSeek V4 Pro deepseek-deepseek-v4-pro | Imported | 2026-05-27 |
| 7 | Kimi-K2.6 | 17.0 +/- 2.6 | MoonshotAI: Kimi K2.6 moonshotai-kimi-k2.6 | Imported | 2026-05-27 |
| 8 | MiMo-v2.5-Pro | 16.7 +/- 4.0 | MiMo-V2.5-Pro xiaomi-mimo-v2.5-pro | Imported | 2026-05-27 |
| 9 | Qwen3.6-Plus | 13.7 +/- 4.0 | Qwen3.6 Plus qwen-qwen3.6-plus | Imported | 2026-05-27 |
| 10 | MiniMax M2.7 | 8.7 +/- 1.2 | MiniMax M2.7 minimax-minimax-m2.7 | Imported | 2026-05-27 |
| 11 | Gemini Pro 3.1 | 6.0 +/- 1.0 | — | Imported | 2026-05-27 |
| 12 | Grok-4.20 | 5.3 +/- 3.2 | Grok 4.20 x-ai-grok-4.20 | Imported | 2026-05-27 |
No matching rows.