HELM Instruct

HELM Instruct: Evaluates broad language-model knowledge, reasoning, commonsense, instruction following, or exam-style accuracy.

4rows
mean_win_rateprimary metric
2026-05-28sampled

Metadata

Metrics

Mean win rate, Anthropic RLHF dataset - Helpfulness, Anthropic RLHF dataset - Understandability, Anthropic RLHF dataset - Completeness, Anthropic RLHF dataset - Conciseness, Anthropic RLHF dataset - Harmlessness, Best ChatGPT Prompts - Helpfulness, Best ChatGPT Prompts - Understandability, Best ChatGPT Prompts - Completeness, Best ChatGPT Prompts - Conciseness, Best ChatGPT Prompts - Harmlessness

Latest Results

Rows are imported from the HELM Instruct public GCS instruction_following group JSON. Mean win rate is reported as a percentage.

Rank Subject Mean win rate Model Match Provenance Sampled
1 GPT-3.5 Turbo (0613) 0.688889 GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-28
2 Anthropic Claude v1.3 0.611111 Imported 2026-05-28
3 GPT-4 (0314) 0.611111 GPT-4
openai-gpt-4
Imported 2026-05-28
4 Cohere Command beta (52.4B) 0.088889 Imported 2026-05-28