GPT-4o

GPT / OpenAI

188scores
121benchmarks
$2.5 / $10 per 1M tokenscost in/out

Metadata

GPT Closed/API

Aliases: gpt-4o, openai-gpt-4o, openai/gpt-4o

Benchmark Results

Benchmark Category Rank Score Sampled
AgentIF Agentic 2 58.5 2026-05-27
APEX-Agents Agentic 40 5.40 2026-05-06
GISA Agentic 16 5.63 2026-05-27
LLM Game Benchmark Agentic 3 0.40 2026-05-06
PinchBench Agentic 55 0.71 2026-05-06
RealDataAgentBench Agentic 4 0.85 2026-04-28
Tau2-Bench Telecom Agentic 278 25.1% 2026-05-11
Terminal-Bench Hard Agentic 220 8.3% 2026-05-11
ToolSandbox Agentic 1 73 2026-05-27
OpenUGI Alignment 46 54.26 2026-05-06
OpenUGI Alignment 92 50.99 2026-05-06
OpenUGI Alignment 681 32.87 2026-05-06
LAB-Bench Biology 2 0.233333 2026-05-27
TextClass Benchmark Classification 1 1825.22 2026-05-06
TextClass Benchmark Classification 2 1804.54 2026-05-06
TextClass Benchmark Classification 3 1801.72 2026-05-06
Aider Refactoring Benchmark Coding 5 62.90 2026-05-06
Aider Refactoring Benchmark Coding 7 49.40 2026-05-06
CadEval Coding 9 26 2026-05-06
CRUXEval Coding 4 75.80 2026-05-05
CRUXEval Coding 7 67.55 2026-05-05
EvalPlus Coding 5 79.70 2026-05-05
HumanEval+ Coding 4 87.20 2026-05-05
Long Code Arena Coding 4 0.70 2026-05-06
MBPP+ Coding 11 72.20 2026-05-05
McEval Coding 1 65.2% 2026-05-27
Natural Language to Mongosh Coding 10 0.86 2026-05-06
Natural Language to Mongosh Coding 13 0.86 2026-05-06
Natural Language to Mongosh Coding 14 0.86 2026-05-06
Natural Language to Mongosh Coding 35 0.83 2026-05-06
Natural Language to Mongosh Coding 41 0.83 2026-05-06
Natural Language to Mongosh Coding 47 0.82 2026-05-06
Natural Language to Mongosh Coding 52 0.82 2026-05-06
Natural Language to Mongosh Coding 63 0.80 2026-05-06
Natural Language to Mongosh Coding 71 0.79 2026-05-06
Natural Language to Mongosh Coding 73 0.79 2026-05-06
Natural Language to Mongosh Coding 79 0.78 2026-05-06
SciCode Coding 164 36.6% 2026-05-11
SciCode Coding 212 33.4% 2026-05-11
SciCode Coding 214 33.3% 2026-05-11
AIRTBench Cybersecurity 7 20.29 2026-05-06
LongDocURL Document Understanding 1 64.5% 2026-05-27
LongDocURL Document Understanding 2 50.9% 2026-05-27
LongDocURL Document Understanding 3 49.5% 2026-05-27
LongDocURL Document Understanding 4 30.6% 2026-05-27
LongDocURL Document Understanding 5 25% 2026-05-27
LongDocURL Document Understanding 6 16.2% 2026-05-27
LongDocURL Document Understanding 7 9.2% 2026-05-27
MMDocBench Document Understanding 1 71.99% 2026-05-27
VAREX-Bench Document Understanding 6 94.8% EM 2026-05-28
IB-bench Domain Specific 5 6 2026-05-06
EduGuardBench Education 9 0.69 2026-05-27
MathTutorBench Education 2 0.6378 2026-05-27
TutorBench Education 26 36.12 2026-05-06
RoboBench Embodied 8 40.16 2026-05-27
RoboBench Embodied 13 30.23 2026-05-27
kluster.ai LLM Hallucination Detection Leaderboard Factuality 9 96.66 2026-05-06
BizFinBench Finance 3 71.8 2026-05-27
FinBen Finance 7 -5.54% 2026-05-27
FinEval Finance 8 77.65 2026-05-27
FinEval Finance 15 71.9 2026-05-27
FinEval Finance 21 68.5 2026-05-27
FinToolBench Finance 4 0.2302 2026-05-27
INVESTORBENCH Finance 3 39.031% 2026-05-27
Open FinLLM Leaderboard Finance 6 37.507713% 2026-05-27
SECQUE Finance 1 0.69 2026-05-28
BenchLM General Knowledge 74 43 2026-05-06
AgentHarm Generalization 21 48.4% 2026-05-27
AgentHarm Generalization 25 57.7% 2026-05-27
AgentHarm Generalization 34 72.7% 2026-05-27
GDPval Generalization 5 12.5% 2025-09-25
HELM AIR-Bench Generalization 52 0.623463 2026-05-28
HELM AIR-Bench Generalization 67 0.527924 2026-05-28
HELM Safety Generalization 17 0.945905 2026-05-28
LongBench v2 Generalization 12 51.4% 2026-05-27
LongBench v2 Generalization 13 51.2% 2026-05-27
WeirdML Generalization 22 25.12 2026-05-06
WildBench Generalization 2 7.940371456500489 2026-05-27
CHOICE Geospatial 9 0.6275 2026-05-27
GeoCode Leaderboard Geospatial 15 59.02% pass@1 2026-05-28
OmniEarth-Bench Geospatial 8 11.15 2026-05-27
AgentClinic Healthcare 5 34.2% 2026-05-27
HealthBench Healthcare 4 0.3233 2026-05-27
HELM MedQA Healthcare 6 0.876740 2026-05-28
MedAgentBench Healthcare 2 64.00% 2026-05-27
HUMAINE Human Preference 40 3.33 2026-05-06
AIIQ Composite IQ Intelligence 40 83 2026-05-12
Artificial Analysis Intelligence Index Intelligence 266 18.56 2026-05-11
Artificial Analysis Intelligence Index Intelligence 282 17.32 2026-05-11
Artificial Analysis Intelligence Index Intelligence 347 14.11 2026-05-11
ChartBench Intelligence 1 64.27 2026-05-06
HELM Lite Intelligence 1 0.959457 2026-05-28
Humanity's Last Exam Intelligence 303 5% 2026-05-11
Humanity's Last Exam Intelligence 443 3.7% 2026-05-11
Humanity's Last Exam Intelligence 468 3.3% 2026-05-11
MathVision Intelligence 100 30.39 2026-05-06
MathVista Intelligence 23 66.10 2026-05-06
MathVista Intelligence 26 63.80 2026-05-06
MMLU-Pro Intelligence 111 80.3% 2026-05-11
MMLU-Pro Intelligence 152 77.3% 2026-05-11
MMLU-Pro Intelligence 181 74.8% 2026-05-11
SimpleQA Intelligence 8 40.1% 2026-05-27
SimpleQA Intelligence 9 39% 2026-05-27
SimpleQA Intelligence 10 38.8% 2026-05-27
TableBench Intelligence 14 51.96% 2026-05-27
OpenHuEval Language 1 63.77 2026-05-06
LEXam Legal 8 56.93% open / 53.13% MCQ 2026-05-28
ConStory-Bench Long Context 15 CED 0.711 2026-05-28
AIME 2025 Math 196 25.7% 2026-05-11
AIME 2025 Math 245 6% 2026-05-11
IneqMath Math 38 3 2026-05-06
OlympiadBench Math 1 25.89 2026-05-06
OlympiadBench Math 1 39.72 2026-05-06
Omni-MATH Math 6 30.49 2026-05-06
FrontierMath 2025-02-28 Private Mathematics 22 0.34 2026-05-06
OTIS Mock AIME 2024-2025 Mathematics 33 6.39 2026-05-06
BRIDGE Medical Leaderboard Medical 6 52.59 2026-05-27
BRIDGE Medical Leaderboard Medical 56 44.2 2026-05-27
BRIDGE Medical Leaderboard Medical 90 40.66 2026-05-27
LiveMedBench Medical 34 0.0506 2026-05-27
MedHELM Medical 5 0.5696428571428571 2026-05-27
AfroBench Multilingual 1 59.64 2026-05-06
AfroBench-Lite Multilingual 8 65.80 2026-05-06
Design Arena Multimodal 119 919 2026-05-06
Math-VR Multimodal 30 4.3 2026-05-27
MMLongBench-Doc Multimodal 11 46.30 2026-05-06
MMMU-Pro Multimodal 35 51.90 2026-05-06
MMSI-Bench Multimodal 17 30.3% 2026-05-28
Physical AI Bench Understanding Multimodal 16 56.20 2026-05-06
UniGenBench Multimodal 1 95.77 2026-05-06
UniGenBench Multimodal 3 92.48 2026-05-06
UniGenBench English Long Multimodal 1 95.41 2026-05-06
UniGenBench English Long Multimodal 3 92.63 2026-05-06
UniREditBench Multimodal 2 73.39 2026-05-06
V-STaR Multimodal 2 26.26 2026-05-06
Video SimpleQA Multimodal 6 49.30 2026-05-06
Video-MME Multimodal 6 77.20 2026-05-06
Visual-Language Understanding Multimodal 49 34.94 2026-05-06
VPCT Multimodal 7 40 2026-05-06
Balrog Reasoning 6 32.30 2026-05-06
EnigmaEval Reasoning 38 0.80 2026-05-06
GPQA Diamond Reasoning 248 65.5% 2026-05-11
GPQA Diamond Reasoning 316 54.3% 2026-05-11
GPQA Diamond Reasoning 340 51.1% 2026-05-11
Humanity's Last Exam (Text Only) Reasoning 59 2.32 2026-05-06
LingOly-TOO Reasoning 12 0.16 2026-05-06
MultiNRC Reasoning 38 12.42 2026-05-06
SimpleBench Reasoning 25 17.80 2026-05-06
AgentLeak Safety 3 77.60 2026-05-06
Halluverse-M3 Safety 1 80.30% 2026-05-28
ThaiSafetyBench Safety 9 16.04% overall ASR 2026-05-28
ChemBench Science 16 0.61 2026-05-06
ChemBench Science 27 0.51 2026-05-06
CritPt Science 217 0% 2026-05-11
GSO-Bench Science 10 0 2026-05-06
SciKnowEval Science 2 2 2026-05-27
Defects4J Software Engineering 13 0.35 2026-05-27
Defects4J Software Engineering 15 0.341 2026-05-27
RepairBench Software Engineering 13 0.326 2026-05-27
RepairBench Software Engineering 15 0.317 2026-05-27
SWE-Gym Software Engineering 2 9.13% 2026-05-27
SWE-Gym Software Engineering 3 8.7% 2026-05-27
SWE-Gym Software Engineering 4 8.26% 2026-05-27
SWE-Gym Software Engineering 5 8.26% 2026-05-27
SWE-Gym Software Engineering 6 7.83% 2026-05-27
SWE-Gym Software Engineering 7 7.71% 2026-05-27
SWE-Gym Software Engineering 8 7.39% 2026-05-27
SWE-Gym Software Engineering 9 4.78% 2026-05-27
SWE-Gym Software Engineering 10 4.55% 2026-05-27
SWE-Lancer Software Engineering 2 8.1% 2025-07-17
SWE-PRBench Software Engineering 5 0.113 2026-05-27
SWT-Bench Software Engineering 12 45.5% 2026-05-27
SWT-Bench Software Engineering 14 38% 2026-05-27
SWT-Bench Software Engineering 15 37.4% 2026-05-27
SWT-Bench Software Engineering 16 31.6% 2026-05-27
SWT-Bench Software Engineering 21 17.8% 2026-05-27
SWT-Bench Software Engineering 24 14.3% 2026-05-27
VoiceBench Speech 4 87.8 2026-05-27
JSONSchemaBench Structured Output 1 96.9% schema compliance 2026-05-28
JSONSchemaBench Structured Output 9 89.6% schema compliance 2026-05-28
JSONSchemaBench Structured Output 11 87.8% schema compliance 2026-05-28
StructEval Structured Output 1 76.02% 2026-05-28
Generate README Eval Summarization 6 33.13 2026-05-06
LiveSQLBench Text to SQL 26 21.38 2026-05-06
VNTL Leaderboard Translation 1 75.16 2026-05-06
VNTL Leaderboard Translation 1 74.97 2026-05-06
CG-Bench Video 1 39.2% open-ended acc. / 44.9% MCQ long acc. 2026-05-28
Lech Mazur Writing Writing 11 8.18 2026-05-06