CRUXEval | BenchmarkList

Metadata

Avg. pass@1, Input pass@1, Output pass@1, Input pass@5, Output pass@5

Rank	Subject	Avg. pass@1	Model Match	Provenance	Sampled
1	gpt-4-turbo-2024-04-09+cot (n=3)	78.85	GPT-4 Turbo openai-gpt-4-turbo	Imported	2026-05-05
2	claude-3-opus+cot (n=1)	77.70	—	Imported	2026-05-05
3	gpt-4-0613+cot	76.30	GPT-4 openai-gpt-4	Imported	2026-05-05
4	gpt-4o+cot (n=3)	75.80	GPT-4o openai-gpt-4o	Imported	2026-05-05
5	gpt-4-0613	69.25	GPT-4 openai-gpt-4	Imported	2026-05-05
6	gpt-4-turbo-2024-04-09 (n=3)	68.10	GPT-4 Turbo openai-gpt-4-turbo	Imported	2026-05-05
7	gpt-4o (n=3)	67.55	GPT-4o openai-gpt-4o	Imported	2026-05-05
8	claude-3-opus (n=1)	65	—	Imported	2026-05-05
9	semcoder-s-6.7b+cot (under verification)	63.60	—	Imported	2026-05-05
10	semcoder-6.7b+cot (under verification)	63.45	—	Imported	2026-05-05
11	gpt-3.5-turbo-0613+cot	54.65	GPT-3.5 Turbo openai-gpt-3.5-turbo	Imported	2026-05-05
12	gpt-3.5-turbo-0613	49.20	GPT-3.5 Turbo (older v0613) openai-gpt-3.5-turbo-0613	Imported	2026-05-05
13	deepseek-instruct-33b	48.20	—	Imported	2026-05-05
14	starcoder2-15b	47.60	—	Imported	2026-05-05
15	deepseek-base-33b	47.55	—	Imported	2026-05-05
16	codetulu-2-34b	47.55	—	Imported	2026-05-05
17	codellama-34b+cot	46.85	—	Imported	2026-05-05
18	codellama-34b	44.80	—	Imported	2026-05-05
19	phind	43.45	—	Imported	2026-05-05
20	magicoder-ds-6.7b	43.05	—	Imported	2026-05-05
21	wizard-34b	43.05	—	Imported	2026-05-05
22	deepseek-base-6.7b	42.70	—	Imported	2026-05-05
23	codellama-python-34b	42.65	—	Imported	2026-05-05
24	codellama-13b+cot	41.70	—	Imported	2026-05-05
25	codellama-13b	41.10	—	Imported	2026-05-05