MedVidBench

Medical and surgical video understanding benchmark for video large language models, covering 6,245 test samples across eight tasks including temporal action localization, spatiotemporal grounding, captioning, next-action prediction, CVS assessment, video summary, region captioning, and surgical skill assessment.

1rows
average_normalized_scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Average Normalized Score, CVS Accuracy, Next Action Accuracy, Skill Assessment Accuracy, Spatiotemporal Grounding mIoU, Temporal Action Grounding mIoU@0.3, Temporal Action Grounding mIoU@0.5, Dense Video Captioning F1, Dense Video Captioning LLM Judge, Video Summary LLM Judge, Region Caption LLM Judge

Latest Results

Rows are parsed from the public Hugging Face Space setup script that initializes the MedVidBench leaderboard. The live Space stores subsequent submissions in a private/persistent leaderboard, so this snapshot captures the public parseable initial result.

Rank Subject Average Normalized Score Model Match Provenance Sampled
1 uAI-NEXUS-MedVLM-1.0a-7B-RL 44.75 Imported 2026-05-06