MedAgentBench
Interactive EHR-agent benchmark with physician-written tasks over healthcare data and FHIR-style clinical workflows.
12rows
overall_success_rateprimary metric
2026-05-27sampled
Metadata
Metrics
Overall SR, Query SR, Action SR
| Rank | Subject | Overall SR | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude 3.5 Sonnet v2 | 69.67% | — | Imported | 2026-05-27 |
| 2 | GPT-4o | 64.00% | GPT-4o openai-gpt-4o | Imported | 2026-05-27 |
| 3 | DeepSeek-V3 | 62.67% | DeepSeek V3 deepseek-deepseek-chat | Imported | 2026-05-27 |
| 4 | Gemini-1.5 Pro | 62.00% | — | Imported | 2026-05-27 |
| 5 | GPT-4o-mini | 56.33% | GPT-4o-mini openai-gpt-4o-mini | Imported | 2026-05-27 |
| 6 | o3-mini | 51.67% | o3-mini openai-o3-mini | Imported | 2026-05-27 |
| 7 | Qwen2.5 | 51.33% | — | Imported | 2026-05-27 |
| 8 | Llama 3.3 | 46.33% | — | Imported | 2026-05-27 |
| 9 | Gemini 2.0 Flash | 38.33% | Gemini 2.0 Flash google-gemini-2.0-flash | Imported | 2026-05-27 |
| 10 | Gemma2 | 19.33% | — | Imported | 2026-05-27 |
| 11 | Gemini 2.0 Pro | 18.00% | — | Imported | 2026-05-27 |
| 12 | Mistral v0.3 | 4.00% | — | Imported | 2026-05-27 |
No matching rows.