CRUXEval

Code reasoning benchmark for input and output prediction, reporting pass@1 and pass@5 across code language models.

25rows
average_pass_at_1primary metric
2026-05-05sampled

Metadata

Metrics

Avg. pass@1, Input pass@1, Output pass@1, Input pass@5, Output pass@5

Latest Results

Average pass@1 is the mean of input prediction pass@1 and output prediction pass@1.

Rank Subject Avg. pass@1 Model Match Provenance Sampled
1 gpt-4-turbo-2024-04-09+cot (n=3) 78.85 GPT-4 Turbo
openai-gpt-4-turbo
Imported 2026-05-05
2 claude-3-opus+cot (n=1) 77.70 Imported 2026-05-05
3 gpt-4-0613+cot 76.30 GPT-4
openai-gpt-4
Imported 2026-05-05
4 gpt-4o+cot (n=3) 75.80 GPT-4o
openai-gpt-4o
Imported 2026-05-05
5 gpt-4-0613 69.25 GPT-4
openai-gpt-4
Imported 2026-05-05
6 gpt-4-turbo-2024-04-09 (n=3) 68.10 GPT-4 Turbo
openai-gpt-4-turbo
Imported 2026-05-05
7 gpt-4o (n=3) 67.55 GPT-4o
openai-gpt-4o
Imported 2026-05-05
8 claude-3-opus (n=1) 65 Imported 2026-05-05
9 semcoder-s-6.7b+cot (under verification) 63.60 Imported 2026-05-05
10 semcoder-6.7b+cot (under verification) 63.45 Imported 2026-05-05
11 gpt-3.5-turbo-0613+cot 54.65 GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-05
12 gpt-3.5-turbo-0613 49.20 GPT-3.5 Turbo (older v0613)
openai-gpt-3.5-turbo-0613
Imported 2026-05-05
13 deepseek-instruct-33b 48.20 Imported 2026-05-05
14 starcoder2-15b 47.60 Imported 2026-05-05
15 deepseek-base-33b 47.55 Imported 2026-05-05
16 codetulu-2-34b 47.55 Imported 2026-05-05
17 codellama-34b+cot 46.85 Imported 2026-05-05
18 codellama-34b 44.80 Imported 2026-05-05
19 phind 43.45 Imported 2026-05-05
20 magicoder-ds-6.7b 43.05 Imported 2026-05-05
21 wizard-34b 43.05 Imported 2026-05-05
22 deepseek-base-6.7b 42.70 Imported 2026-05-05
23 codellama-python-34b 42.65 Imported 2026-05-05
24 codellama-13b+cot 41.70 Imported 2026-05-05
25 codellama-13b 41.10 Imported 2026-05-05