HELM Instruct
HELM Instruct: Evaluates broad language-model knowledge, reasoning, commonsense, instruction following, or exam-style accuracy.
4rows
mean_win_rateprimary metric
2026-05-28sampled
Metadata
Metrics
Mean win rate, Anthropic RLHF dataset - Helpfulness, Anthropic RLHF dataset - Understandability, Anthropic RLHF dataset - Completeness, Anthropic RLHF dataset - Conciseness, Anthropic RLHF dataset - Harmlessness, Best ChatGPT Prompts - Helpfulness, Best ChatGPT Prompts - Understandability, Best ChatGPT Prompts - Completeness, Best ChatGPT Prompts - Conciseness, Best ChatGPT Prompts - Harmlessness
| Rank | Subject | Mean win rate | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GPT-3.5 Turbo (0613) | 0.688889 | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-28 |
| 2 | Anthropic Claude v1.3 | 0.611111 | — | Imported | 2026-05-28 |
| 3 | GPT-4 (0314) | 0.611111 | GPT-4 openai-gpt-4 | Imported | 2026-05-28 |
| 4 | Cohere Command beta (52.4B) | 0.088889 | — | Imported | 2026-05-28 |
No matching rows.