AutoMedBench
Medical AutoResearch benchmark for single-LLM coding agents across segmentation, image enhancement, VQA, report generation, and lesion detection, with workflow and task-quality scoring.
7rows
average_overall_scoreprimary metric
2026-05-27sampled
Metadata
Metrics
Average Overall Score, Average Agentic Score, Average Task Score, Average Turns, Average Time (lower is better), Average Cost (lower is better), Average Tokens (lower is better), Failure Rate (lower is better), Run Count
| Rank | Subject | Average Overall Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.6 | 69.6946 | — | Imported | 2026-05-27 |
| 2 | GLM-5 | 63.3196 | — | Imported | 2026-05-27 |
| 3 | Gemini 3.1 Pro | 61.2258 | — | Imported | 2026-05-27 |
| 4 | Qwen3.5 | 56.5764 | — | Imported | 2026-05-27 |
| 5 | ChatGPT 5.4 | 53.4705 | — | Imported | 2026-05-27 |
| 6 | MiniMax M2.5 | 52.344 | — | Imported | 2026-05-27 |
| 7 | Kimi K2.5 | 33.8575 | — | Imported | 2026-05-27 |
No matching rows.