Video SimpleQA
Video SimpleQA evaluates factual grounding in large video language models with short-form, multi-hop, temporally grounded video questions.
38rows
f_scoreprimary metric
2026-05-06sampled
Metadata
Metrics
F-score, Correct, Incorrect (lower is better), Not Attempted (lower is better), Correct Given Attempted, Engineering F-score, Nature F-score, Science F-score, Society and Culture F-score
| Rank | Subject | F-score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | o3 (OpenAI) | 66.30 | o3 openai-o3 | Imported | 2026-05-06 |
| 2 | Gemini 2.5 Pro (Google) | 62.60 | Gemini 2.5 Pro google-gemini-2.5-pro | Imported | 2026-05-06 |
| 3 | Gemini 2.5 Flash (Google) | 57 | Gemini 2.5 Flash google-gemini-2.5-flash | Imported | 2026-05-06 |
| 4 | GPT-4.5 (OpenAI) | 54.10 | GPT-4.5 openai-gpt-4.5-preview | Imported | 2026-05-06 |
| 5 | o4-mini (OpenAI) | 54 | o4 Mini openai-o4-mini | Imported | 2026-05-06 |
| 6 | GPT-4o (OpenAI) | 49.30 | GPT-4o openai-gpt-4o | Imported | 2026-05-06 |
| 7 | Qwen-VL-Max (Alibaba) | 39.90 | Qwen VL Max qwen-qwen-vl-max | Imported | 2026-05-06 |
| 8 | Qwen2.5-VL-72B (Alibaba) | 39.50 | Qwen2.5 VL 72B Instruct qwen-qwen2.5-vl-72b-instruct | Imported | 2026-05-06 |
| 9 | Claude 3.7 Sonnet (Anthropic) | 36.20 | Claude 3.7 Sonnet anthropic-claude-3.7-sonnet | Imported | 2026-05-06 |
| 10 | Claude Sonnet 4 (Anthropic) | 35.60 | Claude Sonnet 4 anthropic-claude-sonnet-4 | Imported | 2026-05-06 |
| 11 | Claude 3.5 Sonnet2 (Anthropic) | 35.20 | — | Imported | 2026-05-06 |
| 12 | Qwen2-VL-72B (Alibaba) | 34.20 | — | Imported | 2026-05-06 |
| 13 | Claude 3.5 Sonnet (Anthropic) | 34 | Claude 3.5 Sonnet anthropic-claude-3.5-sonnet | Imported | 2026-05-06 |
| 14 | InternVL3-78B (Shanghai AI Lab) | 33.80 | — | Imported | 2026-05-06 |
| 15 | InternVL3-38B (Shanghai AI Lab) | 31.50 | — | Imported | 2026-05-06 |
| 16 | Qwen2.5-VL-32B (Alibaba) | 30.70 | — | Imported | 2026-05-06 |
| 17 | Keye-VL (Kwai-Keye) | 28.30 | — | Imported | 2026-05-06 |
| 18 | LLaVA-OneVision-72B (Llava Hugging Face) | 25.50 | — | Imported | 2026-05-06 |
| 19 | Qwen2.5-VL-7B (Alibaba) | 25.30 | — | Imported | 2026-05-06 |
| 20 | InternVL3-14B (Shanghai AI Lab) | 25.20 | — | Imported | 2026-05-06 |
| 21 | Qwen-VL-Plus (Alibaba) | 23.70 | Qwen VL Plus qwen-qwen-vl-plus | Imported | 2026-05-06 |
| 22 | InternVL3-8B (Shanghai AI Lab) | 23.50 | — | Imported | 2026-05-06 |
| 23 | Qwen2-VL-7B (Alibaba) | 23.40 | — | Imported | 2026-05-06 |
| 24 | InternVL3-9B (Shanghai AI Lab) | 23.10 | — | Imported | 2026-05-06 |
| 25 | Qwen2.5-VL-3B (Alibaba) | 22.60 | — | Imported | 2026-05-06 |
| 26 | Kimi-VL (MoonshotAI) | 22.40 | — | Imported | 2026-05-06 |
| 27 | LLaVA-1.5-13B (Llava Hugging Face) | 19.70 | — | Imported | 2026-05-06 |
| 28 | LLaVA-OneVision-7B (Llava Hugging Face) | 19.30 | — | Imported | 2026-05-06 |
| 29 | Qwen2-VL-2B (Alibaba) | 17.20 | — | Imported | 2026-05-06 |
| 30 | DeepSeek-VL2-Tiny (DeepSeek) | 16.80 | — | Imported | 2026-05-06 |
| 31 | LLaVA-1.5-7B (Llava Hugging Face) | 16.60 | — | Imported | 2026-05-06 |
| 32 | InternVL3-2B (Shanghai AI Lab) | 15.80 | — | Imported | 2026-05-06 |
| 33 | InternVL3-1B (Shanghai AI Lab) | 11.70 | — | Imported | 2026-05-06 |
| 34 | LLaVA-NeXT-Video-34B (Llava Hugging Face) | 11.50 | — | Imported | 2026-05-06 |
| 35 | LLaVA-NeXT-Video-7B (Llava Hugging Face) | 11.40 | — | Imported | 2026-05-06 |
| 36 | LLaVA-OneVision-0.5B (Llava Hugging Face) | 8 | — | Imported | 2026-05-06 |
| 37 | DeepSeek-VL2-Small (DeepSeek) | 7.40 | — | Imported | 2026-05-06 |
| 38 | DeepSeek-VL2 (DeepSeek) | 4.20 | — | Imported | 2026-05-06 |
No matching rows.