Video SimpleQA

Video SimpleQA evaluates factual grounding in large video language models with short-form, multi-hop, temporally grounded video questions.

38rows
f_scoreprimary metric
2026-05-06sampled

Metadata

Metrics

F-score, Correct, Incorrect (lower is better), Not Attempted (lower is better), Correct Given Attempted, Engineering F-score, Nature F-score, Science F-score, Society and Culture F-score

Latest Results

Rows are parsed from the public static HTML leaderboard and ranked by F-score. The source evaluates factuality in large video language models; CO, IN, NA, and CGA denote Correct, Incorrect, Not Attempted, and Correct Given Attempted.

Rank Subject F-score Model Match Provenance Sampled
1 o3 (OpenAI) 66.30 o3
openai-o3
Imported 2026-05-06
2 Gemini 2.5 Pro (Google) 62.60 Gemini 2.5 Pro
google-gemini-2.5-pro
Imported 2026-05-06
3 Gemini 2.5 Flash (Google) 57 Gemini 2.5 Flash
google-gemini-2.5-flash
Imported 2026-05-06
4 GPT-4.5 (OpenAI) 54.10 GPT-4.5
openai-gpt-4.5-preview
Imported 2026-05-06
5 o4-mini (OpenAI) 54 o4 Mini
openai-o4-mini
Imported 2026-05-06
6 GPT-4o (OpenAI) 49.30 GPT-4o
openai-gpt-4o
Imported 2026-05-06
7 Qwen-VL-Max (Alibaba) 39.90 Qwen VL Max
qwen-qwen-vl-max
Imported 2026-05-06
8 Qwen2.5-VL-72B (Alibaba) 39.50 Qwen2.5 VL 72B Instruct
qwen-qwen2.5-vl-72b-instruct
Imported 2026-05-06
9 Claude 3.7 Sonnet (Anthropic) 36.20 Claude 3.7 Sonnet
anthropic-claude-3.7-sonnet
Imported 2026-05-06
10 Claude Sonnet 4 (Anthropic) 35.60 Claude Sonnet 4
anthropic-claude-sonnet-4
Imported 2026-05-06
11 Claude 3.5 Sonnet2 (Anthropic) 35.20 Imported 2026-05-06
12 Qwen2-VL-72B (Alibaba) 34.20 Imported 2026-05-06
13 Claude 3.5 Sonnet (Anthropic) 34 Claude 3.5 Sonnet
anthropic-claude-3.5-sonnet
Imported 2026-05-06
14 InternVL3-78B (Shanghai AI Lab) 33.80 Imported 2026-05-06
15 InternVL3-38B (Shanghai AI Lab) 31.50 Imported 2026-05-06
16 Qwen2.5-VL-32B (Alibaba) 30.70 Imported 2026-05-06
17 Keye-VL (Kwai-Keye) 28.30 Imported 2026-05-06
18 LLaVA-OneVision-72B (Llava Hugging Face) 25.50 Imported 2026-05-06
19 Qwen2.5-VL-7B (Alibaba) 25.30 Imported 2026-05-06
20 InternVL3-14B (Shanghai AI Lab) 25.20 Imported 2026-05-06
21 Qwen-VL-Plus (Alibaba) 23.70 Qwen VL Plus
qwen-qwen-vl-plus
Imported 2026-05-06
22 InternVL3-8B (Shanghai AI Lab) 23.50 Imported 2026-05-06
23 Qwen2-VL-7B (Alibaba) 23.40 Imported 2026-05-06
24 InternVL3-9B (Shanghai AI Lab) 23.10 Imported 2026-05-06
25 Qwen2.5-VL-3B (Alibaba) 22.60 Imported 2026-05-06
26 Kimi-VL (MoonshotAI) 22.40 Imported 2026-05-06
27 LLaVA-1.5-13B (Llava Hugging Face) 19.70 Imported 2026-05-06
28 LLaVA-OneVision-7B (Llava Hugging Face) 19.30 Imported 2026-05-06
29 Qwen2-VL-2B (Alibaba) 17.20 Imported 2026-05-06
30 DeepSeek-VL2-Tiny (DeepSeek) 16.80 Imported 2026-05-06
31 LLaVA-1.5-7B (Llava Hugging Face) 16.60 Imported 2026-05-06
32 InternVL3-2B (Shanghai AI Lab) 15.80 Imported 2026-05-06
33 InternVL3-1B (Shanghai AI Lab) 11.70 Imported 2026-05-06
34 LLaVA-NeXT-Video-34B (Llava Hugging Face) 11.50 Imported 2026-05-06
35 LLaVA-NeXT-Video-7B (Llava Hugging Face) 11.40 Imported 2026-05-06
36 LLaVA-OneVision-0.5B (Llava Hugging Face) 8 Imported 2026-05-06
37 DeepSeek-VL2-Small (DeepSeek) 7.40 Imported 2026-05-06
38 DeepSeek-VL2 (DeepSeek) 4.20 Imported 2026-05-06