Vals Multimodal Index | BenchmarkList

Metadata

Score, Std. error (lower is better), Latency (lower is better), Cost per test (lower is better)

Rank	Subject	Score	Model Match	Provenance	Sampled
1	Claude Opus 4.8	70.712%	Claude Opus 4.8 anthropic-claude-opus-4.8	Imported	2026-05-28
2	GPT 5.5	67.768%	GPT-5.5 openai-gpt-5.5	Imported	2026-05-28
3	Claude Opus 4.7	67.361%	Claude Opus 4.7 anthropic-claude-opus-4.7	Imported	2026-05-28
4	Gemini 3.5 Flash	62.291%	Gemini 3.5 Flash google-gemini-3.5-flash	Imported	2026-05-28
5	Claude Sonnet 4.6	60.783%	Claude Sonnet 4.6 anthropic-claude-sonnet-4.6	Imported	2026-05-28
6	Kimi K2.6 Thinking	56.788%	KIMI MoonshotAI: Kimi K2.6 moonshotai-kimi-k2.6	Imported	2026-05-28
7	Gemini 3.1 Pro Preview	55.749%	Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview	Imported	2026-05-28
8	GPT 5.4 Mini 2026-03-17	53.298%	GPT-5.4 Mini openai-gpt-5.4-mini	Imported	2026-05-28
9	Gemini 3 Flash Preview	51.975%	Gemini 3 Flash Preview google-gemini-3-flash-preview	Imported	2026-05-28
10	Qwen 3.6 Plus	50.737%	Qwen3.6 Plus qwen-qwen3.6-plus	Imported	2026-05-28
11	GPT 5.4 Nano 2026-03-17	47.484%	GPT-5.4 Nano openai-gpt-5.4-nano	Imported	2026-05-28
12	Grok 4.3	43.435%	GROK Grok 4.3 x-ai-grok-4.3	Imported	2026-05-28
13	Claude Haiku 4.5 20251001 Thinking	42.352%	Claude Haiku 4.5 anthropic-claude-haiku-4.5	Imported	2026-05-28
14	Gemini 3.1 Flash Lite Preview	40.466%	Gemini 3.1 Flash Lite Preview google-gemini-3.1-flash-lite-preview	Imported	2026-05-28
15	Grok 4.20 0309 Reasoning	38.704%	GROK Grok 4.20 x-ai-grok-4.20	Imported	2026-05-28
16	Command A Plus 05 2026	27.186%	—	Imported	2026-05-28