EvalPlus

EvalPlus leaderboard aggregating HumanEval+ and MBPP+ code-generation pass@1 scores.

25rows
evalplus_averageprimary metric
2026-05-05sampled

Metadata

Metrics

EvalPlus Avg., HumanEval+ pass@1, MBPP+ pass@1

Latest Results

Aggregate score is the mean of HumanEval+ pass@1 and MBPP+ pass@1 from the EvalPlus public JSON feed.

Rank Subject EvalPlus Avg. Model Match Provenance Sampled
1 O1 Preview (Sept 2024) 84.60 o1-preview
openai-o1-preview
Imported 2026-05-05
2 O1 Mini (Sept 2024) 83.90 Imported 2026-05-05
3 Qwen2.5-Coder-32B-Instruct 82.10 Qwen2.5 Coder 32B Instruct
qwen-qwen-2.5-coder-32b-instruct
Imported 2026-05-05
4 DeepSeek-V3 (Nov 2024) 79.80 DeepSeek V3
deepseek-deepseek-chat
Imported 2026-05-05
5 GPT 4o (Aug 2024) 79.70 GPT-4o
openai-gpt-4o
Imported 2026-05-05
6 DeepSeek-V2.5 (Nov 2024) 78.80 Imported 2026-05-05
7 DeepSeek-Coder-V2-Instruct 78.70 Imported 2026-05-05
8 Claude Sonnet 3.5 (June 2024) 78 Imported 2026-05-05
9 GPT 4o Mini (July 2024) 77.85 GPT-4o-mini
openai-gpt-4o-mini
Imported 2026-05-05
10 GPT-4-Turbo (Nov 2023) 77.50 GPT-4 Turbo
openai-gpt-4-turbo
Imported 2026-05-05
11 Gemini 1.5 Pro 002 76.95 Imported 2026-05-05
12 claude-3-opus (Mar 2024) 75.35 Imported 2026-05-05
13 OpenCoder-8B-Instruct 74.40 Imported 2026-05-05
14 CodeQwen1.5-7B-Chat 73.85 Imported 2026-05-05
15 Grok Beta 73.05 Imported 2026-05-05
16 DeepSeek-Coder-33B-instruct 72.55 Imported 2026-05-05
17 Gemini 1.5 Flash 002 71.55 Imported 2026-05-05
18 OpenCodeInterpreter-DS-33B 71.15 Imported 2026-05-05
19 Artigenz-Coder-DS-6.7B 71.10 Imported 2026-05-05
20 Llama3-70B-instruct 70.50 Llama 3 70B Instruct
meta-llama-llama-3-70b-instruct
Imported 2026-05-05
21 GPT-3.5-Turbo (Nov 2023) 70.20 GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-05
22 Magicoder-S-DS-6.7B 70.15 Imported 2026-05-05
23 OpenCodeInterpreter-DS-6.7B 69.20 Imported 2026-05-05
24 claude-3-haiku (Mar 2024) 68.85 Claude 3 Haiku
anthropic-claude-3-haiku
Imported 2026-05-05
25 DeepSeek-Coder-6.7B-instruct 68.45 Imported 2026-05-05