MedSafe-Dx
Medical diagnostic safety benchmark measuring harm-weighted safety pass rate, coverage, diagnostic recall, and over-escalation behavior.
11rows
safety_pass_rateprimary metric
2026-05-27sampled
Metadata
Metrics
Safety Pass Rate, Coverage Rate, Relative Harm Reduction, Expected Harm (lower is better), Top-1 Recall, Top-3 Recall, Over-Escalation Rate (lower is better)
| Rank | Subject | Safety Pass Rate | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | openai-gpt-5.2 | 97.6 | GPT-5.2 openai-gpt-5.2 | Imported | 2026-05-27 |
| 2 | anthropic-claude-haiku-4.5 | 95.6 | Claude Haiku 4.5 anthropic-claude-haiku-4.5 | Imported | 2026-05-27 |
| 3 | openai-gpt-5-chat | 94 | GPT-5 Chat openai-gpt-5-chat | Imported | 2026-05-27 |
| 4 | openai-gpt-4o-mini | 90.4 | GPT-4o-mini openai-gpt-4o-mini | Imported | 2026-05-27 |
| 5 | openai-gpt-4.1 | 87.6 | GPT-4.1 openai-gpt-4.1 | Imported | 2026-05-27 |
| 6 | anthropic-claude-sonnet-4.5 | 87.2 | Claude Sonnet 4.5 anthropic-claude-sonnet-4.5 | Imported | 2026-05-27 |
| 7 | deepseek-deepseek-chat-v3-0324 | 85.2 | DeepSeek V3 0324 deepseek-deepseek-chat-v3-0324 | Imported | 2026-05-27 |
| 8 | openai-gpt-oss-120b | 85.2 | gpt-oss-120b openai-gpt-oss-120b | Imported | 2026-05-27 |
| 9 | openai-gpt-5-mini | 84.8 | GPT-5 Mini openai-gpt-5-mini | Imported | 2026-05-27 |
| 10 | google-gemini-2.0-flash | 80 | Gemini 2.0 Flash google-gemini-2.0-flash | Imported | 2026-05-27 |
| 11 | google-gemini-3-pro-preview | 62.4 | Gemini 3 google-gemini-3 | Imported | 2026-05-27 |
No matching rows.