PerceptionTest

A novel multimodal video benchmark designed to evaluate perception and reasoning skills of pre-trained models across video, audio, and text modalities. Contains 11.6k real-world videos (average 23 seconds) filmed by participants worldwide, densely annotated with six types of labels. Focuses on skills (Memory, Abstraction, Physics, Semantics) and reasoning types (descriptive, explanatory, predictive, counterfactual). Shows significant performance gap between human baseline (91.4%) and state-of-the-art video QA models (46.2%).

2rows

scoreprimary metric

2026-05-06sampled

Metadata

ID: perceptiontest
Category: Multimodal
Release: 2023-05-23
Source: Source page
Snapshot: Snapshot source
Post: Announcement post

Metrics

Score, Normalized Score

Rank	Subject	Score	Model Match	Provenance	Sampled
1	Qwen2.5 VL 72B Instruct	0.73	Qwen2.5 VL 72B Instruct qwen-qwen2.5-vl-72b-instruct	Self-reported	2026-05-06
2	Qwen2.5 VL 7B Instruct	0.70	—	Self-reported	2026-05-06

Metadata

Metrics

Latest Results