GPT-4o
GPT / OpenAI
188scores
121benchmarks
$2.5 / $10 per 1M tokenscost in/out
Metadata
GPT Closed/API
Aliases: gpt-4o, openai-gpt-4o, openai/gpt-4o
| Benchmark | Category | Rank | Score | Sampled |
|---|---|---|---|---|
| AgentIF | Agentic | 2 | 58.5 | 2026-05-27 |
| APEX-Agents | Agentic | 40 | 5.40 | 2026-05-06 |
| GISA | Agentic | 16 | 5.63 | 2026-05-27 |
| LLM Game Benchmark | Agentic | 3 | 0.40 | 2026-05-06 |
| PinchBench | Agentic | 55 | 0.71 | 2026-05-06 |
| RealDataAgentBench | Agentic | 4 | 0.85 | 2026-04-28 |
| Tau2-Bench Telecom | Agentic | 278 | 25.1% | 2026-05-11 |
| Terminal-Bench Hard | Agentic | 220 | 8.3% | 2026-05-11 |
| ToolSandbox | Agentic | 1 | 73 | 2026-05-27 |
| OpenUGI | Alignment | 46 | 54.26 | 2026-05-06 |
| OpenUGI | Alignment | 92 | 50.99 | 2026-05-06 |
| OpenUGI | Alignment | 681 | 32.87 | 2026-05-06 |
| LAB-Bench | Biology | 2 | 0.233333 | 2026-05-27 |
| TextClass Benchmark | Classification | 1 | 1825.22 | 2026-05-06 |
| TextClass Benchmark | Classification | 2 | 1804.54 | 2026-05-06 |
| TextClass Benchmark | Classification | 3 | 1801.72 | 2026-05-06 |
| Aider Refactoring Benchmark | Coding | 5 | 62.90 | 2026-05-06 |
| Aider Refactoring Benchmark | Coding | 7 | 49.40 | 2026-05-06 |
| CadEval | Coding | 9 | 26 | 2026-05-06 |
| CRUXEval | Coding | 4 | 75.80 | 2026-05-05 |
| CRUXEval | Coding | 7 | 67.55 | 2026-05-05 |
| EvalPlus | Coding | 5 | 79.70 | 2026-05-05 |
| HumanEval+ | Coding | 4 | 87.20 | 2026-05-05 |
| Long Code Arena | Coding | 4 | 0.70 | 2026-05-06 |
| MBPP+ | Coding | 11 | 72.20 | 2026-05-05 |
| McEval | Coding | 1 | 65.2% | 2026-05-27 |
| Natural Language to Mongosh | Coding | 10 | 0.86 | 2026-05-06 |
| Natural Language to Mongosh | Coding | 13 | 0.86 | 2026-05-06 |
| Natural Language to Mongosh | Coding | 14 | 0.86 | 2026-05-06 |
| Natural Language to Mongosh | Coding | 35 | 0.83 | 2026-05-06 |
| Natural Language to Mongosh | Coding | 41 | 0.83 | 2026-05-06 |
| Natural Language to Mongosh | Coding | 47 | 0.82 | 2026-05-06 |
| Natural Language to Mongosh | Coding | 52 | 0.82 | 2026-05-06 |
| Natural Language to Mongosh | Coding | 63 | 0.80 | 2026-05-06 |
| Natural Language to Mongosh | Coding | 71 | 0.79 | 2026-05-06 |
| Natural Language to Mongosh | Coding | 73 | 0.79 | 2026-05-06 |
| Natural Language to Mongosh | Coding | 79 | 0.78 | 2026-05-06 |
| SciCode | Coding | 164 | 36.6% | 2026-05-11 |
| SciCode | Coding | 212 | 33.4% | 2026-05-11 |
| SciCode | Coding | 214 | 33.3% | 2026-05-11 |
| AIRTBench | Cybersecurity | 7 | 20.29 | 2026-05-06 |
| LongDocURL | Document Understanding | 1 | 64.5% | 2026-05-27 |
| LongDocURL | Document Understanding | 2 | 50.9% | 2026-05-27 |
| LongDocURL | Document Understanding | 3 | 49.5% | 2026-05-27 |
| LongDocURL | Document Understanding | 4 | 30.6% | 2026-05-27 |
| LongDocURL | Document Understanding | 5 | 25% | 2026-05-27 |
| LongDocURL | Document Understanding | 6 | 16.2% | 2026-05-27 |
| LongDocURL | Document Understanding | 7 | 9.2% | 2026-05-27 |
| MMDocBench | Document Understanding | 1 | 71.99% | 2026-05-27 |
| VAREX-Bench | Document Understanding | 6 | 94.8% EM | 2026-05-28 |
| IB-bench | Domain Specific | 5 | 6 | 2026-05-06 |
| EduGuardBench | Education | 9 | 0.69 | 2026-05-27 |
| MathTutorBench | Education | 2 | 0.6378 | 2026-05-27 |
| TutorBench | Education | 26 | 36.12 | 2026-05-06 |
| RoboBench | Embodied | 8 | 40.16 | 2026-05-27 |
| RoboBench | Embodied | 13 | 30.23 | 2026-05-27 |
| kluster.ai LLM Hallucination Detection Leaderboard | Factuality | 9 | 96.66 | 2026-05-06 |
| BizFinBench | Finance | 3 | 71.8 | 2026-05-27 |
| FinBen | Finance | 7 | -5.54% | 2026-05-27 |
| FinEval | Finance | 8 | 77.65 | 2026-05-27 |
| FinEval | Finance | 15 | 71.9 | 2026-05-27 |
| FinEval | Finance | 21 | 68.5 | 2026-05-27 |
| FinToolBench | Finance | 4 | 0.2302 | 2026-05-27 |
| INVESTORBENCH | Finance | 3 | 39.031% | 2026-05-27 |
| Open FinLLM Leaderboard | Finance | 6 | 37.507713% | 2026-05-27 |
| SECQUE | Finance | 1 | 0.69 | 2026-05-28 |
| BenchLM | General Knowledge | 74 | 43 | 2026-05-06 |
| AgentHarm | Generalization | 21 | 48.4% | 2026-05-27 |
| AgentHarm | Generalization | 25 | 57.7% | 2026-05-27 |
| AgentHarm | Generalization | 34 | 72.7% | 2026-05-27 |
| GDPval | Generalization | 5 | 12.5% | 2025-09-25 |
| HELM AIR-Bench | Generalization | 52 | 0.623463 | 2026-05-28 |
| HELM AIR-Bench | Generalization | 67 | 0.527924 | 2026-05-28 |
| HELM Safety | Generalization | 17 | 0.945905 | 2026-05-28 |
| LongBench v2 | Generalization | 12 | 51.4% | 2026-05-27 |
| LongBench v2 | Generalization | 13 | 51.2% | 2026-05-27 |
| WeirdML | Generalization | 22 | 25.12 | 2026-05-06 |
| WildBench | Generalization | 2 | 7.940371456500489 | 2026-05-27 |
| CHOICE | Geospatial | 9 | 0.6275 | 2026-05-27 |
| GeoCode Leaderboard | Geospatial | 15 | 59.02% pass@1 | 2026-05-28 |
| OmniEarth-Bench | Geospatial | 8 | 11.15 | 2026-05-27 |
| AgentClinic | Healthcare | 5 | 34.2% | 2026-05-27 |
| HealthBench | Healthcare | 4 | 0.3233 | 2026-05-27 |
| HELM MedQA | Healthcare | 6 | 0.876740 | 2026-05-28 |
| MedAgentBench | Healthcare | 2 | 64.00% | 2026-05-27 |
| HUMAINE | Human Preference | 40 | 3.33 | 2026-05-06 |
| AIIQ Composite IQ | Intelligence | 40 | 83 | 2026-05-12 |
| Artificial Analysis Intelligence Index | Intelligence | 266 | 18.56 | 2026-05-11 |
| Artificial Analysis Intelligence Index | Intelligence | 282 | 17.32 | 2026-05-11 |
| Artificial Analysis Intelligence Index | Intelligence | 347 | 14.11 | 2026-05-11 |
| ChartBench | Intelligence | 1 | 64.27 | 2026-05-06 |
| HELM Lite | Intelligence | 1 | 0.959457 | 2026-05-28 |
| Humanity's Last Exam | Intelligence | 303 | 5% | 2026-05-11 |
| Humanity's Last Exam | Intelligence | 443 | 3.7% | 2026-05-11 |
| Humanity's Last Exam | Intelligence | 468 | 3.3% | 2026-05-11 |
| MathVision | Intelligence | 100 | 30.39 | 2026-05-06 |
| MathVista | Intelligence | 23 | 66.10 | 2026-05-06 |
| MathVista | Intelligence | 26 | 63.80 | 2026-05-06 |
| MMLU-Pro | Intelligence | 111 | 80.3% | 2026-05-11 |
| MMLU-Pro | Intelligence | 152 | 77.3% | 2026-05-11 |
| MMLU-Pro | Intelligence | 181 | 74.8% | 2026-05-11 |
| SimpleQA | Intelligence | 8 | 40.1% | 2026-05-27 |
| SimpleQA | Intelligence | 9 | 39% | 2026-05-27 |
| SimpleQA | Intelligence | 10 | 38.8% | 2026-05-27 |
| TableBench | Intelligence | 14 | 51.96% | 2026-05-27 |
| OpenHuEval | Language | 1 | 63.77 | 2026-05-06 |
| LEXam | Legal | 8 | 56.93% open / 53.13% MCQ | 2026-05-28 |
| ConStory-Bench | Long Context | 15 | CED 0.711 | 2026-05-28 |
| AIME 2025 | Math | 196 | 25.7% | 2026-05-11 |
| AIME 2025 | Math | 245 | 6% | 2026-05-11 |
| IneqMath | Math | 38 | 3 | 2026-05-06 |
| OlympiadBench | Math | 1 | 25.89 | 2026-05-06 |
| OlympiadBench | Math | 1 | 39.72 | 2026-05-06 |
| Omni-MATH | Math | 6 | 30.49 | 2026-05-06 |
| FrontierMath 2025-02-28 Private | Mathematics | 22 | 0.34 | 2026-05-06 |
| OTIS Mock AIME 2024-2025 | Mathematics | 33 | 6.39 | 2026-05-06 |
| BRIDGE Medical Leaderboard | Medical | 6 | 52.59 | 2026-05-27 |
| BRIDGE Medical Leaderboard | Medical | 56 | 44.2 | 2026-05-27 |
| BRIDGE Medical Leaderboard | Medical | 90 | 40.66 | 2026-05-27 |
| LiveMedBench | Medical | 34 | 0.0506 | 2026-05-27 |
| MedHELM | Medical | 5 | 0.5696428571428571 | 2026-05-27 |
| AfroBench | Multilingual | 1 | 59.64 | 2026-05-06 |
| AfroBench-Lite | Multilingual | 8 | 65.80 | 2026-05-06 |
| Design Arena | Multimodal | 119 | 919 | 2026-05-06 |
| Math-VR | Multimodal | 30 | 4.3 | 2026-05-27 |
| MMLongBench-Doc | Multimodal | 11 | 46.30 | 2026-05-06 |
| MMMU-Pro | Multimodal | 35 | 51.90 | 2026-05-06 |
| MMSI-Bench | Multimodal | 17 | 30.3% | 2026-05-28 |
| Physical AI Bench Understanding | Multimodal | 16 | 56.20 | 2026-05-06 |
| UniGenBench | Multimodal | 1 | 95.77 | 2026-05-06 |
| UniGenBench | Multimodal | 3 | 92.48 | 2026-05-06 |
| UniGenBench English Long | Multimodal | 1 | 95.41 | 2026-05-06 |
| UniGenBench English Long | Multimodal | 3 | 92.63 | 2026-05-06 |
| UniREditBench | Multimodal | 2 | 73.39 | 2026-05-06 |
| V-STaR | Multimodal | 2 | 26.26 | 2026-05-06 |
| Video SimpleQA | Multimodal | 6 | 49.30 | 2026-05-06 |
| Video-MME | Multimodal | 6 | 77.20 | 2026-05-06 |
| Visual-Language Understanding | Multimodal | 49 | 34.94 | 2026-05-06 |
| VPCT | Multimodal | 7 | 40 | 2026-05-06 |
| Balrog | Reasoning | 6 | 32.30 | 2026-05-06 |
| EnigmaEval | Reasoning | 38 | 0.80 | 2026-05-06 |
| GPQA Diamond | Reasoning | 248 | 65.5% | 2026-05-11 |
| GPQA Diamond | Reasoning | 316 | 54.3% | 2026-05-11 |
| GPQA Diamond | Reasoning | 340 | 51.1% | 2026-05-11 |
| Humanity's Last Exam (Text Only) | Reasoning | 59 | 2.32 | 2026-05-06 |
| LingOly-TOO | Reasoning | 12 | 0.16 | 2026-05-06 |
| MultiNRC | Reasoning | 38 | 12.42 | 2026-05-06 |
| SimpleBench | Reasoning | 25 | 17.80 | 2026-05-06 |
| AgentLeak | Safety | 3 | 77.60 | 2026-05-06 |
| Halluverse-M3 | Safety | 1 | 80.30% | 2026-05-28 |
| ThaiSafetyBench | Safety | 9 | 16.04% overall ASR | 2026-05-28 |
| ChemBench | Science | 16 | 0.61 | 2026-05-06 |
| ChemBench | Science | 27 | 0.51 | 2026-05-06 |
| CritPt | Science | 217 | 0% | 2026-05-11 |
| GSO-Bench | Science | 10 | 0 | 2026-05-06 |
| SciKnowEval | Science | 2 | 2 | 2026-05-27 |
| Defects4J | Software Engineering | 13 | 0.35 | 2026-05-27 |
| Defects4J | Software Engineering | 15 | 0.341 | 2026-05-27 |
| RepairBench | Software Engineering | 13 | 0.326 | 2026-05-27 |
| RepairBench | Software Engineering | 15 | 0.317 | 2026-05-27 |
| SWE-Gym | Software Engineering | 2 | 9.13% | 2026-05-27 |
| SWE-Gym | Software Engineering | 3 | 8.7% | 2026-05-27 |
| SWE-Gym | Software Engineering | 4 | 8.26% | 2026-05-27 |
| SWE-Gym | Software Engineering | 5 | 8.26% | 2026-05-27 |
| SWE-Gym | Software Engineering | 6 | 7.83% | 2026-05-27 |
| SWE-Gym | Software Engineering | 7 | 7.71% | 2026-05-27 |
| SWE-Gym | Software Engineering | 8 | 7.39% | 2026-05-27 |
| SWE-Gym | Software Engineering | 9 | 4.78% | 2026-05-27 |
| SWE-Gym | Software Engineering | 10 | 4.55% | 2026-05-27 |
| SWE-Lancer | Software Engineering | 2 | 8.1% | 2025-07-17 |
| SWE-PRBench | Software Engineering | 5 | 0.113 | 2026-05-27 |
| SWT-Bench | Software Engineering | 12 | 45.5% | 2026-05-27 |
| SWT-Bench | Software Engineering | 14 | 38% | 2026-05-27 |
| SWT-Bench | Software Engineering | 15 | 37.4% | 2026-05-27 |
| SWT-Bench | Software Engineering | 16 | 31.6% | 2026-05-27 |
| SWT-Bench | Software Engineering | 21 | 17.8% | 2026-05-27 |
| SWT-Bench | Software Engineering | 24 | 14.3% | 2026-05-27 |
| VoiceBench | Speech | 4 | 87.8 | 2026-05-27 |
| JSONSchemaBench | Structured Output | 1 | 96.9% schema compliance | 2026-05-28 |
| JSONSchemaBench | Structured Output | 9 | 89.6% schema compliance | 2026-05-28 |
| JSONSchemaBench | Structured Output | 11 | 87.8% schema compliance | 2026-05-28 |
| StructEval | Structured Output | 1 | 76.02% | 2026-05-28 |
| Generate README Eval | Summarization | 6 | 33.13 | 2026-05-06 |
| LiveSQLBench | Text to SQL | 26 | 21.38 | 2026-05-06 |
| VNTL Leaderboard | Translation | 1 | 75.16 | 2026-05-06 |
| VNTL Leaderboard | Translation | 1 | 74.97 | 2026-05-06 |
| CG-Bench | Video | 1 | 39.2% open-ended acc. / 44.9% MCQ long acc. | 2026-05-28 |
| Lech Mazur Writing | Writing | 11 | 8.18 | 2026-05-06 |
No matching rows.