GPT-5.4

GPT / OpenAI

192scores
131benchmarks
$2.5 / $15 per 1M tokenscost in/out

Metadata

GPT Closed/API

Aliases: gpt-5.4, gpt-5.4-20260305, openai-gpt-5.4, openai-gpt-5.4-20260305, openai/gpt-5.4, openai/gpt-5.4-20260305

Benchmark Results

Benchmark Category Rank Score Sampled
APEX-Agents-AA Agentic 2 33.3% 2026-05-11
ARC-AGI-1 Agentic 12 93.67 2026-05-05
ARC-AGI-1 Agentic 15 92.67 2026-05-05
ARC-AGI-1 Agentic 27 86.17 2026-05-05
ARC-AGI-1 Agentic 43 68.17 2026-05-05
ARC-AGI-1 Agentic 4 93.7% 2026-04-23
ARC-AGI-2 Agentic 10 73.95 2026-05-05
ARC-AGI-2 Agentic 17 67.50 2026-05-05
ARC-AGI-2 Agentic 25 55.42 2026-05-05
ARC-AGI-2 Agentic 37 29.17 2026-05-05
ARC-AGI-2 Agentic 5 73.3% 2026-04-23
ARC-AGI-3 Agentic 4 0.21 2026-05-05
AutoBench Agentic 6 3.13 2026-05-06
AutoLab Agentic 5 0.56 2026-05-06
BrowseComp Agentic 5 82.7% 2026-04-23
Claw-Eval-Live Agentic 2 63.8 2026-05-27
Gert Labs Rankings Agentic 8 0.62 2026-05-11
HiL-Bench Agentic 7 9.33% 2026-05-05
Hindsight LLM Memory Leaderboard Agentic 3 86.80 2026-05-06
ITBench-AA Agentic 12 34.5% 2026-05-28
ITBench-AA Agentic 23 18.9% 2026-05-28
LMArena Search Arena Agentic 13 1200.55 2026-05-06
MCP Atlas Agentic 5 70.60 2026-05-06
MCP Atlas Agentic 4 70.6% 2026-04-23
MCP Atlas Agentic 4 68.1% 2026-04-16
OSWorld-Verified Agentic 4 0.75 2026-05-06
OSWorld-Verified Agentic 3 75% 2026-04-23
OSWorld-Verified Agentic 3 75% 2026-04-16
PinchBench Agentic 3 0.90 2026-05-06
RuneBench Agentic 2 4.70 2026-05-05
Tau2-Bench Telecom Agentic 71 87.1% 2026-05-11
Tau2-Bench Telecom Agentic 121 74.6% 2026-05-11
Tau2-Bench Telecom Agentic 214 35.1% 2026-05-11
Tau2-Bench Telecom Agentic 2 92.8% 2026-04-23
Terminal-Bench Hard Agentic 3 57.6% 2026-05-11
Terminal-Bench Hard Agentic 28 43.2% 2026-05-11
Terminal-Bench Hard Agentic 45 37.9% 2026-05-11
Toolathlon Agentic 2 0.55 2026-05-06
Toolathlon Agentic 2 54.6% 2026-04-23
WildClawBench Agentic 2 50.30 2026-05-06
OpenUGI Alignment 177 47.01 2026-05-06
OpenUGI Alignment 323 41.71 2026-05-06
OpenUGI Alignment 341 41.16 2026-05-06
OpenUGI Alignment 415 38.96 2026-05-06
OpenUGI Alignment 622 33.80 2026-05-06
scBench Biology 3 57.44% 2026-05-27
SpatialBench Biology 2 57.44% 2026-05-27
ALE-Bench Coding 3 1607 2026-05-06
ALE-Bench Coding 5 1520.72 2026-05-06
ALE-Bench Coding 23 1086.03 2026-05-06
Arena AI Code Coding 14 1457 2026-05-06
Arena AI Code Coding 21 1437 2026-05-06
DeepSWE Coding 2 55.53 2026-05-26
Expert-SWE (Internal) Coding 2 68.5% 2026-04-23
IOI Coding 1 67.834% 2026-05-26
LiveCodeBench Coding 24 84.141% 2026-05-28
LMArena WebDev Arena Coding 14 1456.78 2026-05-06
LMArena WebDev Arena Coding 21 1437.09 2026-05-06
SciCode Coding 2 56.6% 2026-05-11
SciCode Coding 16 50.3% 2026-05-11
SciCode Coding 27 47.1% 2026-05-11
SWE Atlas - Codebase QnA Coding 1 40.80 2026-05-06
SWE Atlas - Codebase QnA Coding 1 36.30 2026-05-06
SWE Atlas - Refactoring Coding 1 44.29 2026-05-06
SWE Atlas - Test Writing Coding 1 44.36 2026-05-06
SWE Atlas - Test Writing Coding 1 40 2026-05-06
SWE-bench Verified Coding 7 78.2% 2026-05-28
Terminal-Bench 2.0 Coding 12 58.427% 2026-05-28
Terminal-Bench 2.0 Coding 2 75.1% 2026-04-23
Terminal-Bench 2.0 Coding 2 75.1% 2026-04-16
Vibe Code Bench v1.1 Coding 4 67.421% 2026-05-28
Capture-the-Flags Challenge Tasks (Internal) Cybersecurity 2 83.7% 2026-04-23
CyberGym Cybersecurity 2 79% 2026-04-23
CyberGym Cybersecurity 4 66.3% 2026-04-16
SecCodeBench Cybersecurity 8 59.74% 2026-05-28
DAXBench Data 25 83.2% 2026-05-28
OmniDocBench 1.5 Document Understanding 5 0.89 2026-05-06
Arena AI Document Document AI 8 1480 2026-05-06
OfficeQA Pro Document AI 2 53.2% 2026-04-23
SAGE Education 23 43.312% 2026-05-28
AA-Omniscience Factuality 9 5.65 2026-05-11
Vectara HHEM Hallucination Leaderboard Factuality 32 93 2026-05-06
CorpFin v2 Finance 17 65.268% 2026-05-28
Finance Agent v1.1 Finance 11 57.152% 2026-05-04
Finance Agent v1.1 Finance 5 56% 2026-04-23
Investment Banking Modeling Tasks (Internal) Finance 3 87.3% 2026-04-23
MortgageTax Finance 11 68.323% 2026-05-28
PRBench Finance Finance 8 45.63 2026-05-06
QuantSightBench Finance 3 0.7533 coverage 2026-05-28
TaxBench Finance 13 9.33% mean pass^5 2026-05-27
TaxEval v2 Finance 27 73.958% 2026-05-28
React Native Evals Frontend Development 4 85.348% overall 2026-05-28
InfiniteBM Chess Game 6 334.92 Elo / 7 games 2026-05-28
InfiniteBM Coup Game 1 1690.86 Elo / 21 games 2026-05-28
InfiniteBM Heads-Up No-Limit Hold'em Game 17 1172.92 Elo / 114 games 2026-05-28
InfiniteBM Heads-Up No-Limit Hold'em Game 29 1003.42 Elo / 14 games 2026-05-28
InfiniteBM Liar's Dice Game 24 1165.34 Elo / 117 games 2026-05-28
InfiniteBM Liar's Dice Game 35 852.51 Elo / 35 games 2026-05-28
InfiniteBM Settlers of Catan Game 4 1106.18 Elo / 16 games 2026-05-28
InfiniteBM Werewolf Game 1 2241.79 Elo / 7 games 2026-05-28
InfiniteBM Werewolf Game 10 901.77 Elo / 11 games 2026-05-28
MageBench Season 1 Game 7 1658 rating / 8 games 2026-05-28
ALL Bench LLM General Knowledge 23 27.59 2026-05-06
BenchLM General Knowledge 8 89 2026-05-06
GDPval Generalization 2 83% 2026-04-23
LMArena Text Arena Generalization 11 1468.81 2026-05-06
LMArena Text Arena Generalization 20 1452.22 2026-05-06
MedCode Healthcare 24 41.292% 2026-05-28
MedQA Healthcare 5 96.092% 2026-04-16
MedScribe Healthcare 28 77.549% 2026-05-28
PhysicianBench Healthcare 4 27.7 +/- 1.5 2026-05-27
HUMAINE Human Preference 7 3.70 2026-05-06
AIIQ Composite IQ Intelligence 2 134 2026-05-12
Artificial Analysis Intelligence Index Intelligence 5 56.8 2026-05-11
Artificial Analysis Intelligence Index Intelligence 32 47.94 2026-05-11
Artificial Analysis Intelligence Index Intelligence 107 35.39 2026-05-11
GPQA Diamond Intelligence 7 91.666% 2026-05-28
Humanity's Last Exam Intelligence 4 41.6% 2026-05-11
Humanity's Last Exam Intelligence 27 28.9% 2026-05-11
Humanity's Last Exam Intelligence 143 10.6% 2026-05-11
Humanity's Last Exam Intelligence 5 52.1% 2026-04-23
LiveBench Intelligence 2 80.91 2026-05-05
LiveBench Intelligence 9 75.60 2026-05-05
MathVision Intelligence 1 96.10 2026-05-06
MathVision Intelligence 4 92 2026-05-06
MMLU Pro Intelligence 13 87.482% 2026-05-28
MMMU Pro Intelligence 6 87.514% 2026-05-28
CaseLaw v2 Legal 16 63.773% 2026-05-04
LegalBench Legal 5 86.044% 2026-05-28
Professional Reasoning Bench - Legal Legal 9 44.35 2026-05-06
Graphwalks BFS >128k Long Context 4 0.21 2026-05-06
Graphwalks BFS 1M F1 Long Context 3 9.4% 2026-04-23
Graphwalks BFS 256k F1 Long Context 3 62.5% 2026-04-23
Graphwalks parents >128k Long Context 3 0.32 2026-05-06
Graphwalks Parents 1M F1 Long Context 3 44.4% 2026-04-23
Graphwalks Parents 256k F1 Long Context 3 82.8% 2026-04-23
OpenAI MRCR v2 8-needle 128K-256K Long Context 2 79.3% 2026-04-23
OpenAI MRCR v2 8-needle 16K-32K Long Context 1 97.2% 2026-04-23
OpenAI MRCR v2 8-needle 256K-512K Long Context 2 57.5% 2026-04-23
OpenAI MRCR v2 8-needle 32K-64K Long Context 1 90.5% 2026-04-23
OpenAI MRCR v2 8-needle 4K-8K Long Context 2 97.3% 2026-04-23
OpenAI MRCR v2 8-needle 512K-1M Long Context 2 36.6% 2026-04-23
OpenAI MRCR v2 8-needle 64K-128K Long Context 1 86% 2026-04-23
OpenAI MRCR v2 8-needle 8K-16K Long Context 2 91.4% 2026-04-23
AIME Math 5 96.667% 2026-04-16
LiveMathematicianBench Math 2 41.8% 2026-05-28
LiveMathematicianBench Math 3 41.2% 2026-05-28
ProofBench Math 3 56% 2026-05-28
FrontierMath 2025-02-28 Private Mathematics 4 47.6% 2026-04-23
FrontierMath Tier 4 2025-07-01 Private Mathematics 4 27.1% 2026-04-23
Medical Chronology LLM Benchmark Medical 8 0.89 2026-05-06
Global MMLU Multilingual 2 90.6% 2026-05-28
ALL Bench Multimodal Multimodal 33 18.39 2026-05-06
ALL Bench Multimodal Multimodal 4 30.09 2026-05-06
Blueprint-Bench 2 Multimodal 4 0.664 +/- 0.018 2026-05-28
Design Arena Multimodal 31 1243 2026-05-06
Design Arena Multimodal 34 1240 2026-05-06
IDP Leaderboard Multimodal 2 83.55 2026-05-06
MMMU-Pro Multimodal 2 82.10 2026-05-06
MMMU-Pro Multimodal 3 81.20 2026-05-06
MMMU-Pro Multimodal 2 82.1% 2026-04-23
Visual-Language Understanding Multimodal 3 50.89 2026-05-06
VTB Multimodal 1 29.17 2026-05-06
ARC-AGI v2 Reasoning 3 0.73 2026-05-06
CAIS Text Capabilities Index Reasoning 3 49.3 2026-05-27
Context Arena Reasoning 11 67.65 2026-05-06
Context Arena Reasoning 12 66.15 2026-05-06
Context Arena Reasoning 14 62.89 2026-05-06
Context Arena Reasoning 16 59.32 2026-05-06
Context Arena Reasoning 54 26.69 2026-05-06
EnigmaEval Reasoning 2 15.96 2026-05-06
GPQA Diamond Reasoning 5 92% 2026-05-11
GPQA Diamond Reasoning 34 87.1% 2026-05-11
GPQA Diamond Reasoning 160 74.8% 2026-05-11
GPQA Diamond Reasoning 5 92.8% 2026-04-23
Graphwalks BFS <128k Reasoning 2 0.93 2026-05-06
Graphwalks parents <128k Reasoning 1 0.90 2026-05-06
Humanity's Last Exam (Text Only) Reasoning 4 36.47 2026-05-06
MultiNRC Reasoning 3 58.29 2026-05-06
CAIS Risk Index Safety 10 44.5 2026-05-27
BixBench Science 2 74% 2026-04-23
CritPt Science 6 23.4% 2026-05-11
CritPt Science 26 7.4% 2026-05-11
CritPt Science 110 0.6% 2026-05-11
GeneBench Science 4 19% 2026-04-23
ProgramBench Software Engineering 4 0% 2026-05-05
SWE-bench Pro Software Engineering 3 57.7% 2026-04-23
SWE-bench Pro Software Engineering 3 57.7% 2026-04-16
Structured Output Benchmark Structured Output 1 87 2026-05-06
LiveSQLBench Text to SQL 8 33.56 2026-05-06
CAIS Vision Capabilities Index Vision 6 58.0 2026-05-27
Roboflow Vision Evals - Visual Understanding Vision 5 76.12% 2026-05-22