AgentClinic
Clinical simulation benchmark for sequential diagnostic decision-making with multimodal patient interactions and compliance checks.
7rows
agentclinic_medqa_diagnostic_accuracyprimary metric
2026-05-27sampled
Metadata
Metrics
AgentClinic-MedQA diagnostic accuracy, AgentClinic-MedQA 95% CI low, AgentClinic-MedQA 95% CI high, AgentClinic-MIMIC-IV diagnostic accuracy, AgentClinic-MIMIC-IV 95% CI low, AgentClinic-MIMIC-IV 95% CI high
| Rank | Subject | AgentClinic-MedQA diagnostic accuracy | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude 3.5 | 62.1% | — | Imported | 2026-05-27 |
| 2 | GPT-4 | 51.6% | GPT-4 openai-gpt-4 | Imported | 2026-05-27 |
| 3 | Mixtral-8x7B | 37.1% | — | Imported | 2026-05-27 |
| 4 | GPT-3.5 | 36.6% | — | Imported | 2026-05-27 |
| 5 | GPT-4o | 34.2% | GPT-4o openai-gpt-4o | Imported | 2026-05-27 |
| 6 | Llama 3 70B-Instruct | 19.0% | Llama 3 70B Instruct meta-llama-llama-3-70b-instruct | Imported | 2026-05-27 |
| 7 | Llama 2 70B-chat | 4.5% | — | Imported | 2026-05-27 |
No matching rows.