RubricEval
RubricEval is a scalable framework for evaluating instruction-following models on open-ended tasks using example-specific human-authored rubrics and GPT-4o grading.
13rows
scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Score, 95% CI +, 95% CI - (lower is better)
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GPT-4 Omni | 3.18 | GPT-4 openai-gpt-4 | Imported | 2026-05-06 |
| 2 | GPT-4 Turbo | 3.10 | GPT-4 Turbo openai-gpt-4-turbo | Imported | 2026-05-06 |
| 3 | Gemini 1.5 Pro | 3.06 | — | Imported | 2026-05-06 |
| 4 | Gemini 1.5 Flash | 2.98 | — | Imported | 2026-05-06 |
| 5 | Llama 3 70B | 2.90 | — | Imported | 2026-05-06 |
| 6 | Claude 3 Opus | 2.86 | — | Imported | 2026-05-06 |
| 7 | Claude 3 Sonnet | 2.79 | — | Imported | 2026-05-06 |
| 8 | Claude 3 Haiku | 2.73 | Claude 3 Haiku anthropic-claude-3-haiku | Imported | 2026-05-06 |
| 9 | Gemini 1.0 Pro | 2.56 | — | Imported | 2026-05-06 |
| 10 | Llama 3 8B | 2.56 | — | Imported | 2026-05-06 |
| 11 | GPT-3.5 Turbo | 2.52 | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-06 |
| 12 | Gemma 7B | 2.14 | — | Imported | 2026-05-06 |
| 13 | Gemma 2B | 1.83 | — | Imported | 2026-05-06 |
No matching rows.