MedAgentBench

Interactive EHR-agent benchmark with physician-written tasks over healthcare data and FHIR-style clinical workflows.

12rows
overall_success_rateprimary metric
2026-05-27sampled

Metadata

Metrics

Overall SR, Query SR, Action SR

Latest Results

Rows parsed from Table 2 on the public MedAgentBench project page. Success rates are reported for 300 clinically relevant FHIR-environment tasks.

Rank Subject Overall SR Model Match Provenance Sampled
1 Claude 3.5 Sonnet v2 69.67% Imported 2026-05-27
2 GPT-4o 64.00% GPT-4o
openai-gpt-4o
Imported 2026-05-27
3 DeepSeek-V3 62.67% DeepSeek V3
deepseek-deepseek-chat
Imported 2026-05-27
4 Gemini-1.5 Pro 62.00% Imported 2026-05-27
5 GPT-4o-mini 56.33% GPT-4o-mini
openai-gpt-4o-mini
Imported 2026-05-27
6 o3-mini 51.67% o3-mini
openai-o3-mini
Imported 2026-05-27
7 Qwen2.5 51.33% Imported 2026-05-27
8 Llama 3.3 46.33% Imported 2026-05-27
9 Gemini 2.0 Flash 38.33% Gemini 2.0 Flash
google-gemini-2.0-flash
Imported 2026-05-27
10 Gemma2 19.33% Imported 2026-05-27
11 Gemini 2.0 Pro 18.00% Imported 2026-05-27
12 Mistral v0.3 4.00% Imported 2026-05-27