HREF

HREF evaluates instruction-following models with human response-guided automatic evaluation across 11 task categories.

34rows
averageprimary metric
2026-05-06sampled

Metadata

Metrics

Average, Brainstorm, Open QA, Closed QA, Extract, Generation, Rewrite, Summarize, Classify, Reasoning Over Numerical Data, Multi-Document Synthesis, Fact Checking or Attributed QA

Latest Results

Snapshot mirrors public HREF temperature=0.0 result JSON files. Source scores are fractions; values here are converted to percentages to match the public Space display.

Rank Subject Average Model Match Provenance Sampled
1 meta-llama/Llama-3.1-70B-Instruct 48.98 Llama 3.1 70B Instruct
meta-llama-llama-3.1-70b-instruct
Imported 2026-05-06
1 mistralai/Mistral-Large-Instruct-2407 48.39 Imported 2026-05-06
1 Qwen/Qwen2.5-72B-Instruct 46.21 Qwen2.5 72B Instruct
qwen-qwen-2.5-72b-instruct
Imported 2026-05-06
4 allenai/Llama-3.1-Tulu-3-70B 43.68 Imported 2026-05-06
4 mistralai/Mistral-Small-Instruct-2409 42.87 Imported 2026-05-06
4 Qwen/Qwen1.5-110B-Chat 40.76 Imported 2026-05-06
7 meta-llama/Meta-Llama-3.1-8B-Instruct 38.57 Imported 2026-05-06
8 allenai/OLMo-2-1124-13B-Instruct 35.60 Imported 2026-05-06
8 01-ai/Yi-1.5-34B-Chat 35.10 Imported 2026-05-06
8 Qwen/Qwen2-72B-Instruct 33.71 Imported 2026-05-06
8 allenai/Llama-3.1-Tulu-3-8B 33.54 Imported 2026-05-06
12 microsoft/Phi-3-medium-4k-instruct 30.91 Imported 2026-05-06
12 allenai/OLMo-2-1124-7B-Instruct 28.49 Imported 2026-05-06
14 meta-llama/Llama-2-70b-chat-hf 23.90 Imported 2026-05-06
14 allenai/tulu-2-dpo-70b 22.67 Imported 2026-05-06
14 mistralai/Mistral-7B-Instruct-v0.3 22.66 Imported 2026-05-06
17 allenai/tulu-v2.5-ppo-13b-uf-mean-70b-uf-rm 19.57 Imported 2026-05-06
17 meta-llama/Llama-2-13b-chat-hf 19.56 Imported 2026-05-06
17 WizardLMTeam/WizardLM-13B-V1.2 17.90 Imported 2026-05-06
20 meta-llama/Llama-2-7b-chat-hf 15.38 Imported 2026-05-06
20 allenai/tulu-2-dpo-13b 14.81 Imported 2026-05-06
22 lmsys/vicuna-13b-v1.5 12.99 Imported 2026-05-06
22 allenai/Llama-3.1-Tulu-3-70B-DPO 11.91 Imported 2026-05-06
24 allenai/tulu-2-dpo-7b 10.12 Imported 2026-05-06
24 lmsys/vicuna-7b-v1.5 9.50 Imported 2026-05-06
26 allenai/OLMo-7B-0724-Instruct-hf 7.50 Imported 2026-05-06
26 allenai/OLMo-7B-SFT 6.61 Imported 2026-05-06
26 nomic-ai/gpt4all-13b-snoozy 6.12 GPT-4
openai-gpt-4
Imported 2026-05-06
29 TheBloke/koala-13B-HF 5.66 Imported 2026-05-06
29 mosaicml/mpt-7b-chat 5.53 Imported 2026-05-06
31 TheBloke/koala-7B-HF 4.09 Imported 2026-05-06
31 databricks/dolly-v2-12b 3.53 Imported 2026-05-06
31 databricks/dolly-v2-7b 3.44 Imported 2026-05-06
34 OpenAssistant/oasst-sft-1-pythia-12b 2.18 Imported 2026-05-06