Vibe-Eval
VIBE-Eval is a hard evaluation suite for measuring progress of multimodal language models, consisting of 269 visual understanding prompts with gold-standard responses authored by experts. The benchmark has dual objectives: vibe checking multimodal chat models for day-to-day tasks and rigorously testing frontier models, with the hard set containing >50% questions that all frontier models answer incorrectly.
8rows
scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Score, Normalized Score
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Gemini 2.5 Pro Preview 06-05 | 0.67 | Gemini 2.5 Pro Preview 06-05 google-gemini-2.5-pro-preview | Self-reported | 2026-05-06 |
| 2 | Gemini 2.5 Pro | 0.66 | Gemini 2.5 Pro google-gemini-2.5-pro | Self-reported | 2026-05-06 |
| 3 | Gemini 2.5 Flash | 0.65 | Gemini 2.5 Flash google-gemini-2.5-flash | Self-reported | 2026-05-06 |
| 4 | Gemini 2.0 Flash | 0.56 | Gemini 2.0 Flash google-gemini-2.0-flash | Self-reported | 2026-05-06 |
| 5 | Gemini 1.5 Pro | 0.54 | — | Self-reported | 2026-05-06 |
| 6 | Gemini 2.5 Flash-Lite | 0.51 | Gemini 2.5 Flash Lite google-gemini-2.5-flash-lite | Self-reported | 2026-05-06 |
| 7 | Gemini 1.5 Flash | 0.49 | — | Self-reported | 2026-05-06 |
| 8 | Gemini 1.5 Flash 8B | 0.41 | — | Self-reported | 2026-05-06 |
No matching rows.