GPT-5.4
GPT / OpenAI
192scores
131benchmarks
$2.5 / $15 per 1M tokenscost in/out
Metadata
GPT Closed/API
Aliases: gpt-5.4, gpt-5.4-20260305, openai-gpt-5.4, openai-gpt-5.4-20260305, openai/gpt-5.4, openai/gpt-5.4-20260305
| Benchmark | Category | Rank | Score | Sampled |
|---|---|---|---|---|
| APEX-Agents-AA | Agentic | 2 | 33.3% | 2026-05-11 |
| ARC-AGI-1 | Agentic | 12 | 93.67 | 2026-05-05 |
| ARC-AGI-1 | Agentic | 15 | 92.67 | 2026-05-05 |
| ARC-AGI-1 | Agentic | 27 | 86.17 | 2026-05-05 |
| ARC-AGI-1 | Agentic | 43 | 68.17 | 2026-05-05 |
| ARC-AGI-1 | Agentic | 4 | 93.7% | 2026-04-23 |
| ARC-AGI-2 | Agentic | 10 | 73.95 | 2026-05-05 |
| ARC-AGI-2 | Agentic | 17 | 67.50 | 2026-05-05 |
| ARC-AGI-2 | Agentic | 25 | 55.42 | 2026-05-05 |
| ARC-AGI-2 | Agentic | 37 | 29.17 | 2026-05-05 |
| ARC-AGI-2 | Agentic | 5 | 73.3% | 2026-04-23 |
| ARC-AGI-3 | Agentic | 4 | 0.21 | 2026-05-05 |
| AutoBench | Agentic | 6 | 3.13 | 2026-05-06 |
| AutoLab | Agentic | 5 | 0.56 | 2026-05-06 |
| BrowseComp | Agentic | 5 | 82.7% | 2026-04-23 |
| Claw-Eval-Live | Agentic | 2 | 63.8 | 2026-05-27 |
| Gert Labs Rankings | Agentic | 8 | 0.62 | 2026-05-11 |
| HiL-Bench | Agentic | 7 | 9.33% | 2026-05-05 |
| Hindsight LLM Memory Leaderboard | Agentic | 3 | 86.80 | 2026-05-06 |
| ITBench-AA | Agentic | 12 | 34.5% | 2026-05-28 |
| ITBench-AA | Agentic | 23 | 18.9% | 2026-05-28 |
| LMArena Search Arena | Agentic | 13 | 1200.55 | 2026-05-06 |
| MCP Atlas | Agentic | 5 | 70.60 | 2026-05-06 |
| MCP Atlas | Agentic | 4 | 70.6% | 2026-04-23 |
| MCP Atlas | Agentic | 4 | 68.1% | 2026-04-16 |
| OSWorld-Verified | Agentic | 4 | 0.75 | 2026-05-06 |
| OSWorld-Verified | Agentic | 3 | 75% | 2026-04-23 |
| OSWorld-Verified | Agentic | 3 | 75% | 2026-04-16 |
| PinchBench | Agentic | 3 | 0.90 | 2026-05-06 |
| RuneBench | Agentic | 2 | 4.70 | 2026-05-05 |
| Tau2-Bench Telecom | Agentic | 71 | 87.1% | 2026-05-11 |
| Tau2-Bench Telecom | Agentic | 121 | 74.6% | 2026-05-11 |
| Tau2-Bench Telecom | Agentic | 214 | 35.1% | 2026-05-11 |
| Tau2-Bench Telecom | Agentic | 2 | 92.8% | 2026-04-23 |
| Terminal-Bench Hard | Agentic | 3 | 57.6% | 2026-05-11 |
| Terminal-Bench Hard | Agentic | 28 | 43.2% | 2026-05-11 |
| Terminal-Bench Hard | Agentic | 45 | 37.9% | 2026-05-11 |
| Toolathlon | Agentic | 2 | 0.55 | 2026-05-06 |
| Toolathlon | Agentic | 2 | 54.6% | 2026-04-23 |
| WildClawBench | Agentic | 2 | 50.30 | 2026-05-06 |
| OpenUGI | Alignment | 177 | 47.01 | 2026-05-06 |
| OpenUGI | Alignment | 323 | 41.71 | 2026-05-06 |
| OpenUGI | Alignment | 341 | 41.16 | 2026-05-06 |
| OpenUGI | Alignment | 415 | 38.96 | 2026-05-06 |
| OpenUGI | Alignment | 622 | 33.80 | 2026-05-06 |
| scBench | Biology | 3 | 57.44% | 2026-05-27 |
| SpatialBench | Biology | 2 | 57.44% | 2026-05-27 |
| ALE-Bench | Coding | 3 | 1607 | 2026-05-06 |
| ALE-Bench | Coding | 5 | 1520.72 | 2026-05-06 |
| ALE-Bench | Coding | 23 | 1086.03 | 2026-05-06 |
| Arena AI Code | Coding | 14 | 1457 | 2026-05-06 |
| Arena AI Code | Coding | 21 | 1437 | 2026-05-06 |
| DeepSWE | Coding | 2 | 55.53 | 2026-05-26 |
| Expert-SWE (Internal) | Coding | 2 | 68.5% | 2026-04-23 |
| IOI | Coding | 1 | 67.834% | 2026-05-26 |
| LiveCodeBench | Coding | 24 | 84.141% | 2026-05-28 |
| LMArena WebDev Arena | Coding | 14 | 1456.78 | 2026-05-06 |
| LMArena WebDev Arena | Coding | 21 | 1437.09 | 2026-05-06 |
| SciCode | Coding | 2 | 56.6% | 2026-05-11 |
| SciCode | Coding | 16 | 50.3% | 2026-05-11 |
| SciCode | Coding | 27 | 47.1% | 2026-05-11 |
| SWE Atlas - Codebase QnA | Coding | 1 | 40.80 | 2026-05-06 |
| SWE Atlas - Codebase QnA | Coding | 1 | 36.30 | 2026-05-06 |
| SWE Atlas - Refactoring | Coding | 1 | 44.29 | 2026-05-06 |
| SWE Atlas - Test Writing | Coding | 1 | 44.36 | 2026-05-06 |
| SWE Atlas - Test Writing | Coding | 1 | 40 | 2026-05-06 |
| SWE-bench Verified | Coding | 7 | 78.2% | 2026-05-28 |
| Terminal-Bench 2.0 | Coding | 12 | 58.427% | 2026-05-28 |
| Terminal-Bench 2.0 | Coding | 2 | 75.1% | 2026-04-23 |
| Terminal-Bench 2.0 | Coding | 2 | 75.1% | 2026-04-16 |
| Vibe Code Bench v1.1 | Coding | 4 | 67.421% | 2026-05-28 |
| Capture-the-Flags Challenge Tasks (Internal) | Cybersecurity | 2 | 83.7% | 2026-04-23 |
| CyberGym | Cybersecurity | 2 | 79% | 2026-04-23 |
| CyberGym | Cybersecurity | 4 | 66.3% | 2026-04-16 |
| SecCodeBench | Cybersecurity | 8 | 59.74% | 2026-05-28 |
| DAXBench | Data | 25 | 83.2% | 2026-05-28 |
| OmniDocBench 1.5 | Document Understanding | 5 | 0.89 | 2026-05-06 |
| Arena AI Document | Document AI | 8 | 1480 | 2026-05-06 |
| OfficeQA Pro | Document AI | 2 | 53.2% | 2026-04-23 |
| SAGE | Education | 23 | 43.312% | 2026-05-28 |
| AA-Omniscience | Factuality | 9 | 5.65 | 2026-05-11 |
| Vectara HHEM Hallucination Leaderboard | Factuality | 32 | 93 | 2026-05-06 |
| CorpFin v2 | Finance | 17 | 65.268% | 2026-05-28 |
| Finance Agent v1.1 | Finance | 11 | 57.152% | 2026-05-04 |
| Finance Agent v1.1 | Finance | 5 | 56% | 2026-04-23 |
| Investment Banking Modeling Tasks (Internal) | Finance | 3 | 87.3% | 2026-04-23 |
| MortgageTax | Finance | 11 | 68.323% | 2026-05-28 |
| PRBench Finance | Finance | 8 | 45.63 | 2026-05-06 |
| QuantSightBench | Finance | 3 | 0.7533 coverage | 2026-05-28 |
| TaxBench | Finance | 13 | 9.33% mean pass^5 | 2026-05-27 |
| TaxEval v2 | Finance | 27 | 73.958% | 2026-05-28 |
| React Native Evals | Frontend Development | 4 | 85.348% overall | 2026-05-28 |
| InfiniteBM Chess | Game | 6 | 334.92 Elo / 7 games | 2026-05-28 |
| InfiniteBM Coup | Game | 1 | 1690.86 Elo / 21 games | 2026-05-28 |
| InfiniteBM Heads-Up No-Limit Hold'em | Game | 17 | 1172.92 Elo / 114 games | 2026-05-28 |
| InfiniteBM Heads-Up No-Limit Hold'em | Game | 29 | 1003.42 Elo / 14 games | 2026-05-28 |
| InfiniteBM Liar's Dice | Game | 24 | 1165.34 Elo / 117 games | 2026-05-28 |
| InfiniteBM Liar's Dice | Game | 35 | 852.51 Elo / 35 games | 2026-05-28 |
| InfiniteBM Settlers of Catan | Game | 4 | 1106.18 Elo / 16 games | 2026-05-28 |
| InfiniteBM Werewolf | Game | 1 | 2241.79 Elo / 7 games | 2026-05-28 |
| InfiniteBM Werewolf | Game | 10 | 901.77 Elo / 11 games | 2026-05-28 |
| MageBench Season 1 | Game | 7 | 1658 rating / 8 games | 2026-05-28 |
| ALL Bench LLM | General Knowledge | 23 | 27.59 | 2026-05-06 |
| BenchLM | General Knowledge | 8 | 89 | 2026-05-06 |
| GDPval | Generalization | 2 | 83% | 2026-04-23 |
| LMArena Text Arena | Generalization | 11 | 1468.81 | 2026-05-06 |
| LMArena Text Arena | Generalization | 20 | 1452.22 | 2026-05-06 |
| MedCode | Healthcare | 24 | 41.292% | 2026-05-28 |
| MedQA | Healthcare | 5 | 96.092% | 2026-04-16 |
| MedScribe | Healthcare | 28 | 77.549% | 2026-05-28 |
| PhysicianBench | Healthcare | 4 | 27.7 +/- 1.5 | 2026-05-27 |
| HUMAINE | Human Preference | 7 | 3.70 | 2026-05-06 |
| AIIQ Composite IQ | Intelligence | 2 | 134 | 2026-05-12 |
| Artificial Analysis Intelligence Index | Intelligence | 5 | 56.8 | 2026-05-11 |
| Artificial Analysis Intelligence Index | Intelligence | 32 | 47.94 | 2026-05-11 |
| Artificial Analysis Intelligence Index | Intelligence | 107 | 35.39 | 2026-05-11 |
| GPQA Diamond | Intelligence | 7 | 91.666% | 2026-05-28 |
| Humanity's Last Exam | Intelligence | 4 | 41.6% | 2026-05-11 |
| Humanity's Last Exam | Intelligence | 27 | 28.9% | 2026-05-11 |
| Humanity's Last Exam | Intelligence | 143 | 10.6% | 2026-05-11 |
| Humanity's Last Exam | Intelligence | 5 | 52.1% | 2026-04-23 |
| LiveBench | Intelligence | 2 | 80.91 | 2026-05-05 |
| LiveBench | Intelligence | 9 | 75.60 | 2026-05-05 |
| MathVision | Intelligence | 1 | 96.10 | 2026-05-06 |
| MathVision | Intelligence | 4 | 92 | 2026-05-06 |
| MMLU Pro | Intelligence | 13 | 87.482% | 2026-05-28 |
| MMMU Pro | Intelligence | 6 | 87.514% | 2026-05-28 |
| CaseLaw v2 | Legal | 16 | 63.773% | 2026-05-04 |
| LegalBench | Legal | 5 | 86.044% | 2026-05-28 |
| Professional Reasoning Bench - Legal | Legal | 9 | 44.35 | 2026-05-06 |
| Graphwalks BFS >128k | Long Context | 4 | 0.21 | 2026-05-06 |
| Graphwalks BFS 1M F1 | Long Context | 3 | 9.4% | 2026-04-23 |
| Graphwalks BFS 256k F1 | Long Context | 3 | 62.5% | 2026-04-23 |
| Graphwalks parents >128k | Long Context | 3 | 0.32 | 2026-05-06 |
| Graphwalks Parents 1M F1 | Long Context | 3 | 44.4% | 2026-04-23 |
| Graphwalks Parents 256k F1 | Long Context | 3 | 82.8% | 2026-04-23 |
| OpenAI MRCR v2 8-needle 128K-256K | Long Context | 2 | 79.3% | 2026-04-23 |
| OpenAI MRCR v2 8-needle 16K-32K | Long Context | 1 | 97.2% | 2026-04-23 |
| OpenAI MRCR v2 8-needle 256K-512K | Long Context | 2 | 57.5% | 2026-04-23 |
| OpenAI MRCR v2 8-needle 32K-64K | Long Context | 1 | 90.5% | 2026-04-23 |
| OpenAI MRCR v2 8-needle 4K-8K | Long Context | 2 | 97.3% | 2026-04-23 |
| OpenAI MRCR v2 8-needle 512K-1M | Long Context | 2 | 36.6% | 2026-04-23 |
| OpenAI MRCR v2 8-needle 64K-128K | Long Context | 1 | 86% | 2026-04-23 |
| OpenAI MRCR v2 8-needle 8K-16K | Long Context | 2 | 91.4% | 2026-04-23 |
| AIME | Math | 5 | 96.667% | 2026-04-16 |
| LiveMathematicianBench | Math | 2 | 41.8% | 2026-05-28 |
| LiveMathematicianBench | Math | 3 | 41.2% | 2026-05-28 |
| ProofBench | Math | 3 | 56% | 2026-05-28 |
| FrontierMath 2025-02-28 Private | Mathematics | 4 | 47.6% | 2026-04-23 |
| FrontierMath Tier 4 2025-07-01 Private | Mathematics | 4 | 27.1% | 2026-04-23 |
| Medical Chronology LLM Benchmark | Medical | 8 | 0.89 | 2026-05-06 |
| Global MMLU | Multilingual | 2 | 90.6% | 2026-05-28 |
| ALL Bench Multimodal | Multimodal | 33 | 18.39 | 2026-05-06 |
| ALL Bench Multimodal | Multimodal | 4 | 30.09 | 2026-05-06 |
| Blueprint-Bench 2 | Multimodal | 4 | 0.664 +/- 0.018 | 2026-05-28 |
| Design Arena | Multimodal | 31 | 1243 | 2026-05-06 |
| Design Arena | Multimodal | 34 | 1240 | 2026-05-06 |
| IDP Leaderboard | Multimodal | 2 | 83.55 | 2026-05-06 |
| MMMU-Pro | Multimodal | 2 | 82.10 | 2026-05-06 |
| MMMU-Pro | Multimodal | 3 | 81.20 | 2026-05-06 |
| MMMU-Pro | Multimodal | 2 | 82.1% | 2026-04-23 |
| Visual-Language Understanding | Multimodal | 3 | 50.89 | 2026-05-06 |
| VTB | Multimodal | 1 | 29.17 | 2026-05-06 |
| ARC-AGI v2 | Reasoning | 3 | 0.73 | 2026-05-06 |
| CAIS Text Capabilities Index | Reasoning | 3 | 49.3 | 2026-05-27 |
| Context Arena | Reasoning | 11 | 67.65 | 2026-05-06 |
| Context Arena | Reasoning | 12 | 66.15 | 2026-05-06 |
| Context Arena | Reasoning | 14 | 62.89 | 2026-05-06 |
| Context Arena | Reasoning | 16 | 59.32 | 2026-05-06 |
| Context Arena | Reasoning | 54 | 26.69 | 2026-05-06 |
| EnigmaEval | Reasoning | 2 | 15.96 | 2026-05-06 |
| GPQA Diamond | Reasoning | 5 | 92% | 2026-05-11 |
| GPQA Diamond | Reasoning | 34 | 87.1% | 2026-05-11 |
| GPQA Diamond | Reasoning | 160 | 74.8% | 2026-05-11 |
| GPQA Diamond | Reasoning | 5 | 92.8% | 2026-04-23 |
| Graphwalks BFS <128k | Reasoning | 2 | 0.93 | 2026-05-06 |
| Graphwalks parents <128k | Reasoning | 1 | 0.90 | 2026-05-06 |
| Humanity's Last Exam (Text Only) | Reasoning | 4 | 36.47 | 2026-05-06 |
| MultiNRC | Reasoning | 3 | 58.29 | 2026-05-06 |
| CAIS Risk Index | Safety | 10 | 44.5 | 2026-05-27 |
| BixBench | Science | 2 | 74% | 2026-04-23 |
| CritPt | Science | 6 | 23.4% | 2026-05-11 |
| CritPt | Science | 26 | 7.4% | 2026-05-11 |
| CritPt | Science | 110 | 0.6% | 2026-05-11 |
| GeneBench | Science | 4 | 19% | 2026-04-23 |
| ProgramBench | Software Engineering | 4 | 0% | 2026-05-05 |
| SWE-bench Pro | Software Engineering | 3 | 57.7% | 2026-04-23 |
| SWE-bench Pro | Software Engineering | 3 | 57.7% | 2026-04-16 |
| Structured Output Benchmark | Structured Output | 1 | 87 | 2026-05-06 |
| LiveSQLBench | Text to SQL | 8 | 33.56 | 2026-05-06 |
| CAIS Vision Capabilities Index | Vision | 6 | 58.0 | 2026-05-27 |
| Roboflow Vision Evals - Visual Understanding | Vision | 5 | 76.12% | 2026-05-22 |
No matching rows.