HREF
HREF evaluates instruction-following models with human response-guided automatic evaluation across 11 task categories.
34rows
averageprimary metric
2026-05-06sampled
Metadata
Metrics
Average, Brainstorm, Open QA, Closed QA, Extract, Generation, Rewrite, Summarize, Classify, Reasoning Over Numerical Data, Multi-Document Synthesis, Fact Checking or Attributed QA
| Rank | Subject | Average | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | meta-llama/Llama-3.1-70B-Instruct | 48.98 | Llama 3.1 70B Instruct meta-llama-llama-3.1-70b-instruct | Imported | 2026-05-06 |
| 1 | mistralai/Mistral-Large-Instruct-2407 | 48.39 | — | Imported | 2026-05-06 |
| 1 | Qwen/Qwen2.5-72B-Instruct | 46.21 | Qwen2.5 72B Instruct qwen-qwen-2.5-72b-instruct | Imported | 2026-05-06 |
| 4 | allenai/Llama-3.1-Tulu-3-70B | 43.68 | — | Imported | 2026-05-06 |
| 4 | mistralai/Mistral-Small-Instruct-2409 | 42.87 | — | Imported | 2026-05-06 |
| 4 | Qwen/Qwen1.5-110B-Chat | 40.76 | — | Imported | 2026-05-06 |
| 7 | meta-llama/Meta-Llama-3.1-8B-Instruct | 38.57 | — | Imported | 2026-05-06 |
| 8 | allenai/OLMo-2-1124-13B-Instruct | 35.60 | — | Imported | 2026-05-06 |
| 8 | 01-ai/Yi-1.5-34B-Chat | 35.10 | — | Imported | 2026-05-06 |
| 8 | Qwen/Qwen2-72B-Instruct | 33.71 | — | Imported | 2026-05-06 |
| 8 | allenai/Llama-3.1-Tulu-3-8B | 33.54 | — | Imported | 2026-05-06 |
| 12 | microsoft/Phi-3-medium-4k-instruct | 30.91 | — | Imported | 2026-05-06 |
| 12 | allenai/OLMo-2-1124-7B-Instruct | 28.49 | — | Imported | 2026-05-06 |
| 14 | meta-llama/Llama-2-70b-chat-hf | 23.90 | — | Imported | 2026-05-06 |
| 14 | allenai/tulu-2-dpo-70b | 22.67 | — | Imported | 2026-05-06 |
| 14 | mistralai/Mistral-7B-Instruct-v0.3 | 22.66 | — | Imported | 2026-05-06 |
| 17 | allenai/tulu-v2.5-ppo-13b-uf-mean-70b-uf-rm | 19.57 | — | Imported | 2026-05-06 |
| 17 | meta-llama/Llama-2-13b-chat-hf | 19.56 | — | Imported | 2026-05-06 |
| 17 | WizardLMTeam/WizardLM-13B-V1.2 | 17.90 | — | Imported | 2026-05-06 |
| 20 | meta-llama/Llama-2-7b-chat-hf | 15.38 | — | Imported | 2026-05-06 |
| 20 | allenai/tulu-2-dpo-13b | 14.81 | — | Imported | 2026-05-06 |
| 22 | lmsys/vicuna-13b-v1.5 | 12.99 | — | Imported | 2026-05-06 |
| 22 | allenai/Llama-3.1-Tulu-3-70B-DPO | 11.91 | — | Imported | 2026-05-06 |
| 24 | allenai/tulu-2-dpo-7b | 10.12 | — | Imported | 2026-05-06 |
| 24 | lmsys/vicuna-7b-v1.5 | 9.50 | — | Imported | 2026-05-06 |
| 26 | allenai/OLMo-7B-0724-Instruct-hf | 7.50 | — | Imported | 2026-05-06 |
| 26 | allenai/OLMo-7B-SFT | 6.61 | — | Imported | 2026-05-06 |
| 26 | nomic-ai/gpt4all-13b-snoozy | 6.12 | GPT-4 openai-gpt-4 | Imported | 2026-05-06 |
| 29 | TheBloke/koala-13B-HF | 5.66 | — | Imported | 2026-05-06 |
| 29 | mosaicml/mpt-7b-chat | 5.53 | — | Imported | 2026-05-06 |
| 31 | TheBloke/koala-7B-HF | 4.09 | — | Imported | 2026-05-06 |
| 31 | databricks/dolly-v2-12b | 3.53 | — | Imported | 2026-05-06 |
| 31 | databricks/dolly-v2-7b | 3.44 | — | Imported | 2026-05-06 |
| 34 | OpenAssistant/oasst-sft-1-pythia-12b | 2.18 | — | Imported | 2026-05-06 |
No matching rows.