GPT-5.4 | BenchmarkList

Metadata

GPT Closed/API

Aliases: gpt-5.4, gpt-5.4-20260305, openai-gpt-5.4, openai-gpt-5.4-20260305, openai/gpt-5.4, openai/gpt-5.4-20260305

Benchmark	Category	Rank	Score	Sampled
APEX-Agents-AA	Agentic	2	33.3%	2026-05-11
ARC-AGI-1	Agentic	12	93.67	2026-05-05
ARC-AGI-1	Agentic	15	92.67	2026-05-05
ARC-AGI-1	Agentic	27	86.17	2026-05-05
ARC-AGI-1	Agentic	43	68.17	2026-05-05
ARC-AGI-1	Agentic	4	93.7%	2026-04-23
ARC-AGI-2	Agentic	10	73.95	2026-05-05
ARC-AGI-2	Agentic	17	67.50	2026-05-05
ARC-AGI-2	Agentic	25	55.42	2026-05-05
ARC-AGI-2	Agentic	37	29.17	2026-05-05
ARC-AGI-2	Agentic	5	73.3%	2026-04-23
ARC-AGI-3	Agentic	4	0.21	2026-05-05
AutoBench	Agentic	6	3.13	2026-05-06
AutoLab	Agentic	5	0.56	2026-05-06
BrowseComp	Agentic	5	82.7%	2026-04-23
Claw-Eval-Live	Agentic	2	63.8	2026-05-27
Gert Labs Rankings	Agentic	8	0.62	2026-05-11
HiL-Bench	Agentic	7	9.33%	2026-05-05
Hindsight LLM Memory Leaderboard	Agentic	3	86.80	2026-05-06
ITBench-AA	Agentic	12	34.5%	2026-05-28
ITBench-AA	Agentic	23	18.9%	2026-05-28
LMArena Search Arena	Agentic	13	1200.55	2026-05-06
MCP Atlas	Agentic	5	70.60	2026-05-06
MCP Atlas	Agentic	4	70.6%	2026-04-23
MCP Atlas	Agentic	4	68.1%	2026-04-16
OSWorld-Verified	Agentic	4	0.75	2026-05-06
OSWorld-Verified	Agentic	3	75%	2026-04-23
OSWorld-Verified	Agentic	3	75%	2026-04-16
PinchBench	Agentic	3	0.90	2026-05-06
RuneBench	Agentic	2	4.70	2026-05-05
Tau2-Bench Telecom	Agentic	71	87.1%	2026-05-11
Tau2-Bench Telecom	Agentic	121	74.6%	2026-05-11
Tau2-Bench Telecom	Agentic	214	35.1%	2026-05-11
Tau2-Bench Telecom	Agentic	2	92.8%	2026-04-23
Terminal-Bench Hard	Agentic	3	57.6%	2026-05-11
Terminal-Bench Hard	Agentic	28	43.2%	2026-05-11
Terminal-Bench Hard	Agentic	45	37.9%	2026-05-11
Toolathlon	Agentic	2	0.55	2026-05-06
Toolathlon	Agentic	2	54.6%	2026-04-23
WildClawBench	Agentic	2	50.30	2026-05-06
OpenUGI	Alignment	177	47.01	2026-05-06
OpenUGI	Alignment	323	41.71	2026-05-06
OpenUGI	Alignment	341	41.16	2026-05-06
OpenUGI	Alignment	415	38.96	2026-05-06
OpenUGI	Alignment	622	33.80	2026-05-06
scBench	Biology	3	57.44%	2026-05-27
SpatialBench	Biology	2	57.44%	2026-05-27
ALE-Bench	Coding	3	1607	2026-05-06
ALE-Bench	Coding	5	1520.72	2026-05-06
ALE-Bench	Coding	23	1086.03	2026-05-06
Arena AI Code	Coding	14	1457	2026-05-06
Arena AI Code	Coding	21	1437	2026-05-06
DeepSWE	Coding	2	55.53	2026-05-26
Expert-SWE (Internal)	Coding	2	68.5%	2026-04-23
IOI	Coding	1	67.834%	2026-05-26
LiveCodeBench	Coding	24	84.141%	2026-05-28
LMArena WebDev Arena	Coding	14	1456.78	2026-05-06
LMArena WebDev Arena	Coding	21	1437.09	2026-05-06
SciCode	Coding	2	56.6%	2026-05-11
SciCode	Coding	16	50.3%	2026-05-11
SciCode	Coding	27	47.1%	2026-05-11
SWE Atlas - Codebase QnA	Coding	1	40.80	2026-05-06
SWE Atlas - Codebase QnA	Coding	1	36.30	2026-05-06
SWE Atlas - Refactoring	Coding	1	44.29	2026-05-06
SWE Atlas - Test Writing	Coding	1	44.36	2026-05-06
SWE Atlas - Test Writing	Coding	1	40	2026-05-06
SWE-bench Verified	Coding	7	78.2%	2026-05-28
Terminal-Bench 2.0	Coding	12	58.427%	2026-05-28
Terminal-Bench 2.0	Coding	2	75.1%	2026-04-23
Terminal-Bench 2.0	Coding	2	75.1%	2026-04-16
Vibe Code Bench v1.1	Coding	4	67.421%	2026-05-28
Capture-the-Flags Challenge Tasks (Internal)	Cybersecurity	2	83.7%	2026-04-23
CyberGym	Cybersecurity	2	79%	2026-04-23
CyberGym	Cybersecurity	4	66.3%	2026-04-16
SecCodeBench	Cybersecurity	8	59.74%	2026-05-28
DAXBench	Data	25	83.2%	2026-05-28
OmniDocBench 1.5	Document Understanding	5	0.89	2026-05-06
Arena AI Document	Document AI	8	1480	2026-05-06
OfficeQA Pro	Document AI	2	53.2%	2026-04-23
SAGE	Education	23	43.312%	2026-05-28
AA-Omniscience	Factuality	9	5.65	2026-05-11
Vectara HHEM Hallucination Leaderboard	Factuality	32	93	2026-05-06
CorpFin v2	Finance	17	65.268%	2026-05-28
Finance Agent v1.1	Finance	11	57.152%	2026-05-04
Finance Agent v1.1	Finance	5	56%	2026-04-23
Investment Banking Modeling Tasks (Internal)	Finance	3	87.3%	2026-04-23
MortgageTax	Finance	11	68.323%	2026-05-28
PRBench Finance	Finance	8	45.63	2026-05-06
QuantSightBench	Finance	3	0.7533 coverage	2026-05-28
TaxBench	Finance	13	9.33% mean pass^5	2026-05-27
TaxEval v2	Finance	27	73.958%	2026-05-28
React Native Evals	Frontend Development	4	85.348% overall	2026-05-28
InfiniteBM Chess	Game	6	334.92 Elo / 7 games	2026-05-28
InfiniteBM Coup	Game	1	1690.86 Elo / 21 games	2026-05-28
InfiniteBM Heads-Up No-Limit Hold'em	Game	17	1172.92 Elo / 114 games	2026-05-28
InfiniteBM Heads-Up No-Limit Hold'em	Game	29	1003.42 Elo / 14 games	2026-05-28
InfiniteBM Liar's Dice	Game	24	1165.34 Elo / 117 games	2026-05-28
InfiniteBM Liar's Dice	Game	35	852.51 Elo / 35 games	2026-05-28
InfiniteBM Settlers of Catan	Game	4	1106.18 Elo / 16 games	2026-05-28
InfiniteBM Werewolf	Game	1	2241.79 Elo / 7 games	2026-05-28
InfiniteBM Werewolf	Game	10	901.77 Elo / 11 games	2026-05-28
MageBench Season 1	Game	7	1658 rating / 8 games	2026-05-28
ALL Bench LLM	General Knowledge	23	27.59	2026-05-06
BenchLM	General Knowledge	8	89	2026-05-06
GDPval	Generalization	2	83%	2026-04-23
LMArena Text Arena	Generalization	11	1468.81	2026-05-06
LMArena Text Arena	Generalization	20	1452.22	2026-05-06
MedCode	Healthcare	24	41.292%	2026-05-28
MedQA	Healthcare	5	96.092%	2026-04-16
MedScribe	Healthcare	28	77.549%	2026-05-28
PhysicianBench	Healthcare	4	27.7 +/- 1.5	2026-05-27
HUMAINE	Human Preference	7	3.70	2026-05-06
AIIQ Composite IQ	Intelligence	2	134	2026-05-12
Artificial Analysis Intelligence Index	Intelligence	5	56.8	2026-05-11
Artificial Analysis Intelligence Index	Intelligence	32	47.94	2026-05-11
Artificial Analysis Intelligence Index	Intelligence	107	35.39	2026-05-11
GPQA Diamond	Intelligence	7	91.666%	2026-05-28
Humanity's Last Exam	Intelligence	4	41.6%	2026-05-11
Humanity's Last Exam	Intelligence	27	28.9%	2026-05-11
Humanity's Last Exam	Intelligence	143	10.6%	2026-05-11
Humanity's Last Exam	Intelligence	5	52.1%	2026-04-23
LiveBench	Intelligence	2	80.91	2026-05-05
LiveBench	Intelligence	9	75.60	2026-05-05
MathVision	Intelligence	1	96.10	2026-05-06
MathVision	Intelligence	4	92	2026-05-06
MMLU Pro	Intelligence	13	87.482%	2026-05-28
MMMU Pro	Intelligence	6	87.514%	2026-05-28
CaseLaw v2	Legal	16	63.773%	2026-05-04
LegalBench	Legal	5	86.044%	2026-05-28
Professional Reasoning Bench - Legal	Legal	9	44.35	2026-05-06
Graphwalks BFS >128k	Long Context	4	0.21	2026-05-06
Graphwalks BFS 1M F1	Long Context	3	9.4%	2026-04-23
Graphwalks BFS 256k F1	Long Context	3	62.5%	2026-04-23
Graphwalks parents >128k	Long Context	3	0.32	2026-05-06
Graphwalks Parents 1M F1	Long Context	3	44.4%	2026-04-23
Graphwalks Parents 256k F1	Long Context	3	82.8%	2026-04-23
OpenAI MRCR v2 8-needle 128K-256K	Long Context	2	79.3%	2026-04-23
OpenAI MRCR v2 8-needle 16K-32K	Long Context	1	97.2%	2026-04-23
OpenAI MRCR v2 8-needle 256K-512K	Long Context	2	57.5%	2026-04-23
OpenAI MRCR v2 8-needle 32K-64K	Long Context	1	90.5%	2026-04-23
OpenAI MRCR v2 8-needle 4K-8K	Long Context	2	97.3%	2026-04-23
OpenAI MRCR v2 8-needle 512K-1M	Long Context	2	36.6%	2026-04-23
OpenAI MRCR v2 8-needle 64K-128K	Long Context	1	86%	2026-04-23
OpenAI MRCR v2 8-needle 8K-16K	Long Context	2	91.4%	2026-04-23
AIME	Math	5	96.667%	2026-04-16
LiveMathematicianBench	Math	2	41.8%	2026-05-28
LiveMathematicianBench	Math	3	41.2%	2026-05-28
ProofBench	Math	3	56%	2026-05-28
FrontierMath 2025-02-28 Private	Mathematics	4	47.6%	2026-04-23
FrontierMath Tier 4 2025-07-01 Private	Mathematics	4	27.1%	2026-04-23
Medical Chronology LLM Benchmark	Medical	8	0.89	2026-05-06
Global MMLU	Multilingual	2	90.6%	2026-05-28
ALL Bench Multimodal	Multimodal	33	18.39	2026-05-06
ALL Bench Multimodal	Multimodal	4	30.09	2026-05-06
Blueprint-Bench 2	Multimodal	4	0.664 +/- 0.018	2026-05-28
Design Arena	Multimodal	31	1243	2026-05-06
Design Arena	Multimodal	34	1240	2026-05-06
IDP Leaderboard	Multimodal	2	83.55	2026-05-06
MMMU-Pro	Multimodal	2	82.10	2026-05-06
MMMU-Pro	Multimodal	3	81.20	2026-05-06
MMMU-Pro	Multimodal	2	82.1%	2026-04-23
Visual-Language Understanding	Multimodal	3	50.89	2026-05-06
VTB	Multimodal	1	29.17	2026-05-06
ARC-AGI v2	Reasoning	3	0.73	2026-05-06
CAIS Text Capabilities Index	Reasoning	3	49.3	2026-05-27
Context Arena	Reasoning	11	67.65	2026-05-06
Context Arena	Reasoning	12	66.15	2026-05-06
Context Arena	Reasoning	14	62.89	2026-05-06
Context Arena	Reasoning	16	59.32	2026-05-06
Context Arena	Reasoning	54	26.69	2026-05-06
EnigmaEval	Reasoning	2	15.96	2026-05-06
GPQA Diamond	Reasoning	5	92%	2026-05-11
GPQA Diamond	Reasoning	34	87.1%	2026-05-11
GPQA Diamond	Reasoning	160	74.8%	2026-05-11
GPQA Diamond	Reasoning	5	92.8%	2026-04-23
Graphwalks BFS <128k	Reasoning	2	0.93	2026-05-06
Graphwalks parents <128k	Reasoning	1	0.90	2026-05-06
Humanity's Last Exam (Text Only)	Reasoning	4	36.47	2026-05-06
MultiNRC	Reasoning	3	58.29	2026-05-06
CAIS Risk Index	Safety	10	44.5	2026-05-27
BixBench	Science	2	74%	2026-04-23
CritPt	Science	6	23.4%	2026-05-11
CritPt	Science	26	7.4%	2026-05-11
CritPt	Science	110	0.6%	2026-05-11
GeneBench	Science	4	19%	2026-04-23
ProgramBench	Software Engineering	4	0%	2026-05-05
SWE-bench Pro	Software Engineering	3	57.7%	2026-04-23
SWE-bench Pro	Software Engineering	3	57.7%	2026-04-16
Structured Output Benchmark	Structured Output	1	87	2026-05-06
LiveSQLBench	Text to SQL	8	33.56	2026-05-06
CAIS Vision Capabilities Index	Vision	6	58.0	2026-05-27
Roboflow Vision Evals - Visual Understanding	Vision	5	76.12%	2026-05-22

Metadata

Benchmark Results