SAGE

Student Assessment with Generative Evaluation

57rows
scoreprimary metric
2026-05-28sampled

Metadata

Metrics

Score, Std. error (lower is better), Latency (lower is better), Cost per test (lower is better)

Latest Results

Full leaderboard rows decoded from the Vals.ai benchmark detail page. Primary score is the Overall accuracy percentage.

Rank Subject Score Model Match Provenance Sampled
1 Claude Opus 4.7 56.103% Claude Opus 4.7
anthropic-claude-opus-4.7
Imported 2026-05-28
2 Gemma 4 31B It 55.034% Gemma 4 31B
google-gemma-4-31b-it
Imported 2026-05-28
3 Claude Opus 4.8 54.788% Claude Opus 4.8
anthropic-claude-opus-4.8
Imported 2026-05-28
4 Claude Opus 4.5 20251101 Thinking 52.092% Claude Opus 4.5
anthropic-claude-opus-4.5
Imported 2026-05-28
5 Gemini 3 Flash Preview 51.849% Gemini 3 Flash Preview
google-gemini-3-flash-preview
Imported 2026-05-28
6 Claude Opus 4.6 Thinking 51.575% Claude Opus 4.6
anthropic-claude-opus-4.6
Imported 2026-05-28
7 GPT 5.5 51.532% GPT-5.5
openai-gpt-5.5
Imported 2026-05-28
8 GPT 5.4 Mini 2026-03-17 50.813% GPT-5.4 Mini
openai-gpt-5.4-mini
Imported 2026-05-28
9 Kimi K2.6 Thinking 50.224% KIMI MoonshotAI: Kimi K2.6
moonshotai-kimi-k2.6
Imported 2026-05-28
10 Gemini 3.5 Flash 49.885% Gemini 3.5 Flash
google-gemini-3.5-flash
Imported 2026-05-28
11 Kimi K2.5 Thinking 49.865% KIMI MoonshotAI: Kimi K2.5
moonshotai-kimi-k2.5
Imported 2026-05-28
12 Gemini 3.1 Flash Lite Preview 49.54% Gemini 3.1 Flash Lite Preview
google-gemini-3.1-flash-lite-preview
Imported 2026-05-28
13 GPT 5.2 2025-12-11 49.27% GPT-5.2
openai-gpt-5.2
Imported 2026-05-28
14 Gemini 3.1 Pro Preview 48.677% Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-28
15 Gemini 3 Pro Preview 47.615% Gemini 3
google-gemini-3
Imported 2026-05-28
16 Claude Sonnet 4.6 46.582% Claude Sonnet 4.6
anthropic-claude-sonnet-4.6
Imported 2026-05-28
17 Qwen 3.6 27B 45.577% Qwen3.6 27B
qwen-qwen3.6-27b
Imported 2026-05-28
18 Claude Opus 4.5 20251101 45.002% Claude Opus 4.5
anthropic-claude-opus-4.5
Imported 2026-05-28
19 Qwen 3.6 Plus 44.86% Qwen3.6 Plus
qwen-qwen3.6-plus
Imported 2026-05-28
20 Gemini 2.5 Flash 44.756% Gemini 2.5 Flash
google-gemini-2.5-flash
Imported 2026-05-28
21 Gemini 2.5 Flash Preview 09 2025 44.106% Imported 2026-05-28
22 GPT 5.2025-08-07 43.68% GPT-5
openai-gpt-5
Imported 2026-05-28
23 GPT 5.4 2026-03-05 43.312% GPT-5.4
openai-gpt-5.4
Imported 2026-05-28
24 GPT 5.1 2025-11-13 43.235% GPT-5.1
openai-gpt-5.1
Imported 2026-05-28
25 GPT 5 Mini 2025-08-07 42.988% GPT-5 Mini
openai-gpt-5-mini
Imported 2026-05-28
26 Qwen 3.5 Flash 42.489% Qwen3.5-Flash
qwen-qwen3.5-flash-02-23
Imported 2026-05-28
27 Qwen 3 Vl Plus 2025-09-23 42.097% Imported 2026-05-28
28 Gemini 2.5 Pro 41.916% Gemini 2.5 Pro
google-gemini-2.5-pro
Imported 2026-05-28
29 O3 2025-04-16 41.771% o3
openai-o3
Imported 2026-05-28
30 O4 Mini 2025-04-16 41.061% o4 Mini
openai-o4-mini
Imported 2026-05-28
31 Gemini 2.5 Flash Thinking 39.602% Imported 2026-05-28
32 Gemini 2.5 Flash Preview 09 2025 Thinking 39.375% Imported 2026-05-28
33 Grok 4.20 0309 Reasoning 38.242% GROK Grok 4.20
x-ai-grok-4.20
Imported 2026-05-28
34 GPT 5.4 Nano 2026-03-17 38.081% GPT-5.4 Nano
openai-gpt-5.4-nano
Imported 2026-05-28
35 Mistral Medium 3.5 37.613% Mistral: Mistral Medium 3.5
mistralai-mistral-medium-3-5
Imported 2026-05-28
36 Claude Sonnet 4.5 20250929 Thinking 36.065% Claude Sonnet 4.5
anthropic-claude-sonnet-4.5
Imported 2026-05-28
37 Llama4 Maverick Instruct Basic 35.702% Imported 2026-05-28
38 Claude Sonnet 4.20250514 35% Claude Sonnet 4
anthropic-claude-sonnet-4
Imported 2026-05-28
39 Llama 4 Scout 17B 16E Instruct 34.834% Llama 4 Scout
meta-llama-llama-4-scout
Imported 2026-05-28
40 Claude Sonnet 4.5 20250929 32.88% Claude Sonnet 4.5
anthropic-claude-sonnet-4.5
Imported 2026-05-28
41 Claude Sonnet 4.20250514 Thinking 32.005% Imported 2026-05-28
42 Claude Haiku 4.5 20251001 Thinking 31.822% Claude Haiku 4.5
anthropic-claude-haiku-4.5
Imported 2026-05-28
43 Claude Opus 4.1 20250805 31.38% Claude Opus 4.1
anthropic-claude-opus-4.1
Imported 2026-05-28
44 Grok 4.1 Fast Reasoning 31.331% GROK Grok 4.1 Fast
x-ai-grok-4.1-fast
Imported 2026-05-28
45 Gemini 2.5 Flash Lite Preview 09 2025 Thinking 30.776% Imported 2026-05-28
46 Qwen 3.5 Plus Thinking 30.402% Imported 2026-05-28
47 Claude Opus 4.1 20250805 Thinking 30.388% Claude Opus 4.1
anthropic-claude-opus-4.1
Imported 2026-05-28
48 GPT 5 Nano 2025-08-07 30.377% GPT-5 Nano
openai-gpt-5-nano
Imported 2026-05-28
49 Grok 4 Fast Reasoning 29.761% GROK Grok 4 Fast
x-ai-grok-4-fast
Imported 2026-05-28
50 Gemini 2.5 Flash Lite Preview 09 2025 27.973% Gemini 2.5 Flash Lite Preview 09-2025
google-gemini-2.5-flash-lite-preview-09-2025
Imported 2026-05-28
51 Gemini 2.5 Flash Lite 27.533% Gemini 2.5 Flash Lite
google-gemini-2.5-flash-lite
Imported 2026-05-28
52 Grok 4.0709 25.101% GROK Grok 4
x-ai-grok-4
Imported 2026-05-28
53 Mistral Large 2512 24.595% Mistral: Mistral Large 3 2512
mistralai-mistral-large-2512
Imported 2026-05-28
54 Grok 4.3 19.736% GROK Grok 4.3
x-ai-grok-4.3
Imported 2026-05-28
55 Grok 4 Fast Non Reasoning 15.989% GROK Grok 4 Fast
x-ai-grok-4-fast
Imported 2026-05-28
56 Grok 4.1 Fast Non Reasoning 11.915% GROK Grok 4.1 Fast
x-ai-grok-4.1-fast
Imported 2026-05-28
57 Command A Plus 05 2026 10.488% Imported 2026-05-28