GPT-4 Turbo | BenchmarkList

Metadata

GPT Closed/API

Aliases: gpt-4-turbo, openai-gpt-4-turbo, openai/gpt-4-turbo

Benchmark	Category	Rank	Score	Sampled
LLM Game Benchmark	Agentic	5	0.35	2026-05-06
MLAgentBench	Agentic	3	26.0%	2026-05-27
RewardBench	Alignment	58	83.95	2026-05-06
TextClass Benchmark	Classification	5	1781.47	2026-05-06
Aider Refactoring Benchmark	Coding	10	34.10	2026-05-06
Aider Refactoring Benchmark	Coding	14	21.40	2026-05-06
BigCodeBench	Coding	11	48.20	2026-05-06
BigCodeBench-Hard	Coding	17	29.10	2026-05-05
CRUXEval	Coding	1	78.85	2026-05-05
CRUXEval	Coding	6	68.10	2026-05-05
DS-1000	Coding	1	0.539	2026-05-27
ENAMEL	Coding	2	0.47	2026-05-06
EvalPlus	Coding	10	77.50	2026-05-05
HumanEval+	Coding	6	86.60	2026-05-05
HumanEval+	Coding	11	81.70	2026-05-05
LiveCodeBench	Coding	25	28.70	2026-05-06
MBPP+	Coding	8	73.30	2026-05-05
McEval	Coding	2	63.4%	2026-05-27
SciCode	Coding	231	31.9%	2026-05-11
K-12EduBench	Education	18	55.94	2026-05-27
BenchLM	General Knowledge	98	26	2026-05-06
MixEval Chat	General Knowledge	6	62.60	2026-05-06
HELM AIR-Bench	Generalization	34	0.718739	2026-05-28
HELM Safety	Generalization	13	0.960619	2026-05-28
WeirdML	Generalization	27	18.01	2026-05-06
WildBench	Generalization	4	7.804496578690127	2026-05-27
MedQA	Healthcare	69	81.986%	2026-04-16
RubricEval	Instruction Following	2	3.10	2026-05-06
AIIQ Composite IQ	Intelligence	46	76	2026-05-12
Artificial Analysis Intelligence Index	Intelligence	355	13.72	2026-05-11
HELM Lite	Intelligence	4	0.898402	2026-05-28
HELM Lite	Intelligence	16	0.745371	2026-05-28
Humanity's Last Exam	Intelligence	467	3.3%	2026-05-11
MathVision	Intelligence	101	30.26	2026-05-06
MMLU-Pro	Intelligence	227	69.4%	2026-05-11
SimpleQA	Intelligence	12	24.2%	2026-05-27
TableBench	Intelligence	15	51.5%	2026-05-27
LegalBench	Legal	53	80.462%	2026-05-28
OTIS Mock AIME 2024-2025	Mathematics	32	6.67	2026-05-06
BenchBench	Meta	7	0.91	2026-05-06
DROP	Reasoning	4	0.86	2026-05-06
NPHardEval	Reasoning	1	0.38	2026-05-06
SimpleBench	Reasoning	19	25.10	2026-05-06
ZebraLogic	Reasoning	20	28.40	2026-05-06
ChatRAG Bench	Retrieval	4	54.03	2026-05-06
SciKnowEval	Science	4	4	2026-05-27
DevBench	Software Engineering	1	56.1636	2026-05-27
DevBench	Software Engineering	2	54.1273	2026-05-27

Metadata

Benchmark Results