AlpacaEval
Automatic instruction-following evaluator comparing model responses against a reference using GPT-4 judgments and length-controlled win rates.
102rows
length_controlled_winrateprimary metric
2026-05-27sampled
Metadata
Metrics
Length-Controlled Win Rate, Win Rate, Standard Error (lower is better), Discrete Win Rate, Average Length (lower is better)
| Rank | Subject | Length-Controlled Win Rate | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | xwinlm-70b-v0.1 | 95.56803995 | — | Imported | 2026-05-27 |
| 2 | Mistral-7B-ReMax-v0.1 | 94.39601494396015 | — | Imported | 2026-05-27 |
| 3 | xwinlm-70b-v0.3 | 94.01522563893708 | — | Imported | 2026-05-27 |
| 4 | xwinlm-13b-v0.1 | 91.76029963 | — | Imported | 2026-05-27 |
| 5 | mistral-medium | 91.54314285144824 | — | Imported | 2026-05-27 |
| 6 | ultralm-13b-best-of-16 | 91.54228856 | — | Imported | 2026-05-27 |
| 7 | gpt4_1106_preview | 89.85849210429464 | GPT-4 openai-gpt-4 | Imported | 2026-05-27 |
| 8 | openchat-v3.1-13b | 89.49004975 | — | Imported | 2026-05-27 |
| 9 | wizardlm-13b-v1.2 | 89.16562889 | — | Imported | 2026-05-27 |
| 10 | vicuna-33b-v1.3 | 88.99253731 | — | Imported | 2026-05-27 |
| 11 | humpback-llama2-70b | 87.93532338 | — | Imported | 2026-05-27 |
| 12 | xwinlm-7b-v0.1 | 87.82771536 | — | Imported | 2026-05-27 |
| 13 | openbuddy-llama2-70b-v10.1 | 87.67123288 | — | Imported | 2026-05-27 |
| 14 | openchat-v2-w-13b | 87.12686567 | — | Imported | 2026-05-27 |
| 15 | openbuddy-llama-65b-v8 | 86.53366584 | — | Imported | 2026-05-27 |
| 16 | gpt4 | 86.51018625518144 | GPT-4 openai-gpt-4 | Imported | 2026-05-27 |
| 17 | wizardlm-13b-v1.1 | 86.31840796 | — | Imported | 2026-05-27 |
| 18 | pairrm-tulu-2-70b | 85.58824844769076 | — | Imported | 2026-05-27 |
| 19 | gpt4_0314 | 85.334647371383 | GPT-4 openai-gpt-4 | Imported | 2026-05-27 |
| 20 | openchat-v2-13b | 84.9689441 | — | Imported | 2026-05-27 |
| 21 | LMCocktail-10.7B-v1 | 84.7840193355363 | — | Imported | 2026-05-27 |
| 22 | pairrm-zephyr-7b-beta | 84.7091351498575 | — | Imported | 2026-05-27 |
| 23 | tulu-2-dpo-70b | 84.25730016896037 | — | Imported | 2026-05-27 |
| 24 | humpback-llama-65b | 83.70646766 | — | Imported | 2026-05-27 |
| 25 | Mistral-7B+RAHF-DUAL+LoRA | 83.35673751418108 | — | Imported | 2026-05-27 |
| 26 | Mistral-7B-Instruct-v0.2 | 82.98089782565651 | — | Imported | 2026-05-27 |
| 27 | Mixtral-8x7B-Instruct-v0.1 | 82.59666180688257 | Mistral: Mixtral 8x7B Instruct mistralai-mixtral-8x7b-instruct | Imported | 2026-05-27 |
| 28 | vicuna-13b-v1.3 | 82.11180124 | — | Imported | 2026-05-27 |
| 29 | gpt-3.5-turbo-16k-0613 | 81.73910844041163 | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-27 |
| 30 | openbuddy-llama-30b-v7.1 | 81.54613466 | — | Imported | 2026-05-27 |
| 31 | gpt4_0613 | 81.38159399734118 | GPT-4 openai-gpt-4 | Imported | 2026-05-27 |
| 32 | tulu-2-dpo-13b | 81.235850076993 | — | Imported | 2026-05-27 |
| 33 | openchat-13b | 80.86956522 | — | Imported | 2026-05-27 |
| 34 | openbuddy-falcon-40b-v9 | 80.69738481 | — | Imported | 2026-05-27 |
| 35 | ultralm-13b | 80.63511831 | — | Imported | 2026-05-27 |
| 36 | openchat8192-13b | 79.539801 | — | Imported | 2026-05-27 |
| 37 | gpt-3.5-turbo-0301 | 79.17893267677465 | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-27 |
| 38 | opencoderplus-15b | 78.69565217 | — | Imported | 2026-05-27 |
| 39 | tulu-2-dpo-7b | 77.85355333126851 | — | Imported | 2026-05-27 |
| 40 | openbuddy-llama2-13b-v11.1 | 77.48756219 | — | Imported | 2026-05-27 |
| 41 | vicuna-7b-v1.3 | 76.84144819 | — | Imported | 2026-05-27 |
| 42 | claude | 76.83227965166517 | — | Imported | 2026-05-27 |
| 43 | Yi-34B-Chat | 76.35646640775717 | — | Imported | 2026-05-27 |
| 44 | ultralm-13b-v2.0-best-of-16 | 76.29672881234201 | — | Imported | 2026-05-27 |
| 45 | zephyr-7b-beta | 76.29202319983864 | — | Imported | 2026-05-27 |
| 46 | gpt-3.5-turbo-1106 | 75.55853548412969 | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-27 |
| 47 | claude-2 | 74.33550560445303 | — | Imported | 2026-05-27 |
| 48 | jina-chat | 74.12718204 | — | Imported | 2026-05-27 |
| 49 | llama-2-70b-chat-hf | 74.11120112901445 | — | Imported | 2026-05-27 |
| 50 | airoboros-65b | 73.91304348 | — | Imported | 2026-05-27 |
| 51 | zephyr-7b-alpha | 73.46973908236046 | — | Imported | 2026-05-27 |
| 52 | airoboros-33b | 73.29192547 | — | Imported | 2026-05-27 |
| 53 | evo-v2-7b | 72.09602817675409 | — | Imported | 2026-05-27 |
| 54 | cut-13b | 71.40952810665395 | — | Imported | 2026-05-27 |
| 55 | deita-7b-v1.0 | 71.13305243806445 | — | Imported | 2026-05-27 |
| 56 | ghost-7b-alpha | 70.44025157232704 | — | Imported | 2026-05-27 |
| 57 | openbuddy-falcon-7b-v6 | 70.3611457 | — | Imported | 2026-05-27 |
| 58 | causallm-14b | 69.99239868161098 | — | Imported | 2026-05-27 |
| 59 | pairrm-tulu-2-13b | 68.33213332478894 | — | Imported | 2026-05-27 |
| 60 | baize-v2-13b | 66.95652174 | — | Imported | 2026-05-27 |
| 61 | gpt35_turbo_instruct | 66.88517803643602 | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-27 |
| 62 | minotaur-13b | 66.02484472 | — | Imported | 2026-05-27 |
| 63 | guanaco-33b | 65.96273292 | — | Imported | 2026-05-27 |
| 64 | claude-2.1 | 65.9557674840558 | — | Imported | 2026-05-27 |
| 65 | nous-hermes-13b | 65.46583851 | — | Imported | 2026-05-27 |
| 66 | vicuna-7b | 64.40993789 | — | Imported | 2026-05-27 |
| 67 | baize-v2-7b | 63.85093168 | — | Imported | 2026-05-27 |
| 68 | ultralm-13b-v2.0 | 63.77774668548318 | — | Imported | 2026-05-27 |
| 69 | wizardlm-13b | 62.55024525088112 | — | Imported | 2026-05-27 |
| 70 | cohere | 61.87530037843918 | — | Imported | 2026-05-27 |
| 71 | gemini-pro | 57.96703555960053 | — | Imported | 2026-05-27 |
| 72 | oasst-rlhf-llama-33b | 55.80913636693129 | — | Imported | 2026-05-27 |
| 73 | oasst-sft-llama-33b | 54.9689441 | — | Imported | 2026-05-27 |
| 74 | guanaco-65b | 54.69096685665386 | — | Imported | 2026-05-27 |
| 75 | phi-2-dpo | 54.28867357876411 | — | Imported | 2026-05-27 |
| 76 | platolm-7b | 53.09897561500652 | — | Imported | 2026-05-27 |
| 77 | guanaco-13b | 52.60869565 | — | Imported | 2026-05-27 |
| 78 | minichat-1.5-3b | 51.47924234116803 | — | Imported | 2026-05-27 |
| 79 | recycled-wizardlm-7b-v2.0 | 51.09808140925867 | — | Imported | 2026-05-27 |
| 80 | vicuna-13b | 50.00294675412896 | — | Imported | 2026-05-27 |
| 81 | text_davinci_003 | 50 | — | Imported | 2026-05-27 |
| 82 | evo-7b | 49.96597750089794 | — | Imported | 2026-05-27 |
| 83 | llama-2-13b-chat-hf | 49.81099211276289 | — | Imported | 2026-05-27 |
| 84 | claude2-alpaca-13b | 49.72428405745508 | — | Imported | 2026-05-27 |
| 85 | chatglm2-6b | 47.12858926 | — | Imported | 2026-05-27 |
| 86 | guanaco-7b | 46.58385093 | — | Imported | 2026-05-27 |
| 87 | recycled-wizardlm-7b-v1.0 | 46.27776656706335 | — | Imported | 2026-05-27 |
| 88 | llama-2-chat-7b-evol70k-neft | 45.84186320829894 | — | Imported | 2026-05-27 |
| 89 | phi-2-sft | 44.73886185749778 | — | Imported | 2026-05-27 |
| 90 | alpaca-farm-ppo-sim-gpt4-20k | 44.09937888 | GPT-4 openai-gpt-4 | Imported | 2026-05-27 |
| 91 | pythia-12b-mix-sft | 41.86335404 | — | Imported | 2026-05-27 |
| 92 | falcon-40b-instruct | 39.14246411706998 | — | Imported | 2026-05-27 |
| 93 | falcon-7b-instruct | 39.14246411706998 | — | Imported | 2026-05-27 |
| 94 | minichat-3b | 31.963518903280573 | — | Imported | 2026-05-27 |
| 95 | alpaca-7b-neft | 31.61170102536985 | — | Imported | 2026-05-27 |
| 96 | phi-2 | 29.81920417817079 | — | Imported | 2026-05-27 |
| 97 | alpaca-farm-ppo-human | 29.78213586412439 | — | Imported | 2026-05-27 |
| 98 | llama-2-7b-chat-hf | 29.29429740470164 | — | Imported | 2026-05-27 |
| 99 | alpaca-7b | 26.29495433067113 | — | Imported | 2026-05-27 |
| 100 | oasst-sft-pythia-12b | 25.96273292 | — | Imported | 2026-05-27 |
| 101 | baichuan-13b-chat | 21.80124224 | — | Imported | 2026-05-27 |
| 102 | text_davinci_001 | 20.57118821914347 | — | Imported | 2026-05-27 |
No matching rows.