VTB

Evaluating how LLMs can dynamically interact with and reason about visual information

20rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Confidence Interval Upper, Max Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 gpt-5.4-2026-03-05 (reasoning effort = high) 29.17 GPT-5.4
openai-gpt-5.4
Imported 2026-05-06
1 gemini-3.1-pro-preview 28.97 Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-06
2 claude-opus-4-6-thinking 27.52 Claude Opus 4.6
anthropic-claude-opus-4.6
Imported 2026-05-06
3 gemini-3-pro-preview 26.85 Gemini 3
google-gemini-3
Imported 2026-05-06
5 gpt-5-2025-08-07-thinking 18.68 GPT-5
openai-gpt-5
Imported 2026-05-06
6 gpt-5-2025-08-07 16.96 GPT-5
openai-gpt-5
Imported 2026-05-06
7 o3-2025-04-16 13.74 o3
openai-o3
Imported 2026-05-06
7 gemini-2.5-pro-preview-06-05 11.75 Gemini 2.5 Pro Preview 06-05
google-gemini-2.5-pro-preview
Imported 2026-05-06
8 o4-mini-2025-04-16 11.12 o4 Mini
openai-o4-mini
Imported 2026-05-06
10 claude-sonnet-4-5-20250929-thinking 6.20 Claude Sonnet 4.5
anthropic-claude-sonnet-4.5
Imported 2026-05-06
11 claude-sonnet-4-5-20250929 5.60 Claude Sonnet 4.5
anthropic-claude-sonnet-4.5
Imported 2026-05-06
11 gpt-4.1-2025-04-14 5.52 GPT-4.1
openai-gpt-4.1
Imported 2026-05-06
11 claude-opus-4-1-20250805-thinking 5.16 Claude Opus 4.1
anthropic-claude-opus-4.1
Imported 2026-05-06
13 claude-opus-4-1-20250805 4.71 Claude Opus 4.1
anthropic-claude-opus-4.1
Imported 2026-05-06
13 gemini-2.5-flash 4.69 Gemini 2.5 Flash
google-gemini-2.5-flash
Imported 2026-05-06
13 claude-sonnet-4 4.48 Claude Sonnet 4
anthropic-claude-sonnet-4
Imported 2026-05-06
15 claude-sonnet-4-thinking 4.44 Imported 2026-05-06
18 nova-premier 2 Imported 2026-05-06
19 llama4-scout 1.58 Llama 4 Scout
meta-llama-llama-4-scout
Imported 2026-05-06
19 llama4-maverick 1.41 Llama 4 Maverick
meta-llama-4-maverick
Imported 2026-05-06