MedSafe-Dx

Medical diagnostic safety benchmark measuring harm-weighted safety pass rate, coverage, diagnostic recall, and over-escalation behavior.

11rows
safety_pass_rateprimary metric
2026-05-27sampled

Metadata

Metrics

Safety Pass Rate, Coverage Rate, Relative Harm Reduction, Expected Harm (lower is better), Top-1 Recall, Top-3 Recall, Over-Escalation Rate (lower is better)

Latest Results

Rows parsed from MedSafe-Dx public GitHub 250-case eval JSON files. The benchmark scores medical diagnostic safety and effectiveness under a harm-weighted policy.

Rank Subject Safety Pass Rate Model Match Provenance Sampled
1 openai-gpt-5.2 97.6 GPT-5.2
openai-gpt-5.2
Imported 2026-05-27
2 anthropic-claude-haiku-4.5 95.6 Claude Haiku 4.5
anthropic-claude-haiku-4.5
Imported 2026-05-27
3 openai-gpt-5-chat 94 GPT-5 Chat
openai-gpt-5-chat
Imported 2026-05-27
4 openai-gpt-4o-mini 90.4 GPT-4o-mini
openai-gpt-4o-mini
Imported 2026-05-27
5 openai-gpt-4.1 87.6 GPT-4.1
openai-gpt-4.1
Imported 2026-05-27
6 anthropic-claude-sonnet-4.5 87.2 Claude Sonnet 4.5
anthropic-claude-sonnet-4.5
Imported 2026-05-27
7 deepseek-deepseek-chat-v3-0324 85.2 DeepSeek V3 0324
deepseek-deepseek-chat-v3-0324
Imported 2026-05-27
8 openai-gpt-oss-120b 85.2 gpt-oss-120b
openai-gpt-oss-120b
Imported 2026-05-27
9 openai-gpt-5-mini 84.8 GPT-5 Mini
openai-gpt-5-mini
Imported 2026-05-27
10 google-gemini-2.0-flash 80 Gemini 2.0 Flash
google-gemini-2.0-flash
Imported 2026-05-27
11 google-gemini-3-pro-preview 62.4 Gemini 3
google-gemini-3
Imported 2026-05-27