Vibe-Eval

VIBE-Eval is a hard evaluation suite for measuring progress of multimodal language models, consisting of 269 visual understanding prompts with gold-standard responses authored by experts. The benchmark has dual objectives: vibe checking multimodal chat models for day-to-day tasks and rigorously testing frontier models, with the hard set containing >50% questions that all frontier models answer incorrectly.

8rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Normalized Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 Gemini 2.5 Pro Preview 06-05 0.67 Gemini 2.5 Pro Preview 06-05
google-gemini-2.5-pro-preview
Self-reported 2026-05-06
2 Gemini 2.5 Pro 0.66 Gemini 2.5 Pro
google-gemini-2.5-pro
Self-reported 2026-05-06
3 Gemini 2.5 Flash 0.65 Gemini 2.5 Flash
google-gemini-2.5-flash
Self-reported 2026-05-06
4 Gemini 2.0 Flash 0.56 Gemini 2.0 Flash
google-gemini-2.0-flash
Self-reported 2026-05-06
5 Gemini 1.5 Pro 0.54 Self-reported 2026-05-06
6 Gemini 2.5 Flash-Lite 0.51 Gemini 2.5 Flash Lite
google-gemini-2.5-flash-lite
Self-reported 2026-05-06
7 Gemini 1.5 Flash 0.49 Self-reported 2026-05-06
8 Gemini 1.5 Flash 8B 0.41 Self-reported 2026-05-06