AgentClinic

Clinical simulation benchmark for sequential diagnostic decision-making with multimodal patient interactions and compliance checks.

7rows
agentclinic_medqa_diagnostic_accuracyprimary metric
2026-05-27sampled

Metadata

Metrics

AgentClinic-MedQA diagnostic accuracy, AgentClinic-MedQA 95% CI low, AgentClinic-MedQA 95% CI high, AgentClinic-MIMIC-IV diagnostic accuracy, AgentClinic-MIMIC-IV 95% CI low, AgentClinic-MIMIC-IV 95% CI high

Latest Results

Rows are transcribed from public AgentClinic arXiv v5 Appendix D. Primary score is AgentClinic-MedQA diagnostic accuracy; AgentClinic-MIMIC-IV accuracy and 95% confidence intervals are preserved as metrics.

Rank Subject AgentClinic-MedQA diagnostic accuracy Model Match Provenance Sampled
1 Claude 3.5 62.1% Imported 2026-05-27
2 GPT-4 51.6% GPT-4
openai-gpt-4
Imported 2026-05-27
3 Mixtral-8x7B 37.1% Imported 2026-05-27
4 GPT-3.5 36.6% Imported 2026-05-27
5 GPT-4o 34.2% GPT-4o
openai-gpt-4o
Imported 2026-05-27
6 Llama 3 70B-Instruct 19.0% Llama 3 70B Instruct
meta-llama-llama-3-70b-instruct
Imported 2026-05-27
7 Llama 2 70B-chat 4.5% Imported 2026-05-27