RubricEval

RubricEval is a scalable framework for evaluating instruction-following models on open-ended tasks using example-specific human-authored rubrics and GPT-4o grading.

13rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, 95% CI +, 95% CI - (lower is better)

Latest Results

Snapshot mirrors the RubricEval Space CSV. RubricEval evaluates instruction-following models on open-ended WildBench-derived tasks with example-specific rubrics.

Rank Subject Score Model Match Provenance Sampled
1 GPT-4 Omni 3.18 GPT-4
openai-gpt-4
Imported 2026-05-06
2 GPT-4 Turbo 3.10 GPT-4 Turbo
openai-gpt-4-turbo
Imported 2026-05-06
3 Gemini 1.5 Pro 3.06 Imported 2026-05-06
4 Gemini 1.5 Flash 2.98 Imported 2026-05-06
5 Llama 3 70B 2.90 Imported 2026-05-06
6 Claude 3 Opus 2.86 Imported 2026-05-06
7 Claude 3 Sonnet 2.79 Imported 2026-05-06
8 Claude 3 Haiku 2.73 Claude 3 Haiku
anthropic-claude-3-haiku
Imported 2026-05-06
9 Gemini 1.0 Pro 2.56 Imported 2026-05-06
10 Llama 3 8B 2.56 Imported 2026-05-06
11 GPT-3.5 Turbo 2.52 GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-06
12 Gemma 7B 2.14 Imported 2026-05-06
13 Gemma 2B 1.83 Imported 2026-05-06