AutoMedBench

Medical AutoResearch benchmark for single-LLM coding agents across segmentation, image enhancement, VQA, report generation, and lesion detection, with workflow and task-quality scoring.

7rows
average_overall_scoreprimary metric
2026-05-27sampled

Metadata

Metrics

Average Overall Score, Average Agentic Score, Average Task Score, Average Turns, Average Time (lower is better), Average Cost (lower is better), Average Tokens (lower is better), Failure Rate (lower is better), Run Count

Latest Results

Rows parsed from AutoMedBench's embedded static leaderboard data across segmentation, image enhancement, VQA, report generation, and lesion detection. Agent scores are unweighted means over task-tier rows, matching the site's aggregate chart logic.

Rank Subject Average Overall Score Model Match Provenance Sampled
1 Claude Opus 4.6 69.6946 Imported 2026-05-27
2 GLM-5 63.3196 Imported 2026-05-27
3 Gemini 3.1 Pro 61.2258 Imported 2026-05-27
4 Qwen3.5 56.5764 Imported 2026-05-27
5 ChatGPT 5.4 53.4705 Imported 2026-05-27
6 MiniMax M2.5 52.344 Imported 2026-05-27
7 Kimi K2.5 33.8575 Imported 2026-05-27