V-STaR

Spatio-temporal reasoning benchmark for Video-LLMs, reporting category-level modified arithmetic mean (mAM) and modified logarithmic geometric mean (mLGM) over video question chains.

14rows
mean_mamprimary metric
2026-05-06sampled

Metadata

Metrics

Mean mAM, Mean mLGM

Latest Results

Rank Subject Mean mAM Model Match Provenance Sampled
1 Gemini-2-Flash 27.43 Imported 2026-05-06
2 GPT-4o 26.26 GPT-4o
openai-gpt-4o
Imported 2026-05-06
3 Qwen2.5-VL 22.81 Imported 2026-05-06
4 Video-Llama3 22.62 Imported 2026-05-06
5 Video-CCAM-v1.2 22.05 Imported 2026-05-06
6 Llava-Video 21.93 Imported 2026-05-06
7 InternVL-2.5 17.57 Imported 2026-05-06
8 Sa2VA 17.43 Imported 2026-05-06
9 VTimeLLM 17.22 Imported 2026-05-06
10 Qwen2-VL 17.06 Imported 2026-05-06
11 VideoChat2 16.41 Imported 2026-05-06
12 Oryx-1.5 13.75 Imported 2026-05-06
13 TRACE 13.15 Imported 2026-05-06
14 TimeChat 13.01 Imported 2026-05-06