o3

o-series / OpenAI

97scores
81benchmarks
$2 / $8 per 1M tokenscost in/out

Metadata

o-series Closed/API

Aliases: o3, o3-2025-04-16, openai-o3, openai-o3-2025-04-16, openai/o3, openai/o3-2025-04-16

Benchmark Results

Benchmark Category Rank Score Sampled
ADBench Agentic 2 82 2026-05-06
ALFWorld Agentic 7 0.883 2026-05-27
ALFWorld Agentic 9 0.817 2026-05-27
ALFWorld Agentic 10 0.7 2026-05-27
ARC-AGI-1 Agentic 50 60.83 2026-05-05
ARC-AGI-1 Agentic 63 53.83 2026-05-05
ARC-AGI-1 Agentic 74 41.50 2026-05-05
ARC-AGI-2 Agentic 58 6.53 2026-05-05
ARC-AGI-2 Agentic 81 2.98 2026-05-05
ARC-AGI-2 Agentic 93 1.99 2026-05-05
Berkeley Function-Calling Leaderboard Agentic 8 63.05% 2026-05-27
Berkeley Function-Calling Leaderboard Agentic 30 48.56% 2026-05-27
DEEPSYNTH Agentic 9 3.29 2026-05-27
MCPMark Agentic 18 0.25 2026-05-06
OSWorld Agentic 91 23.0% 2026-05-27
OSWorld Agentic 96 17.17% 2026-05-27
OSWorld Agentic 100 9.1% 2026-05-27
OSWorld-MCP Agentic 10 24.10 2026-05-06
OSWorld-MCP Agentic 11 17.60 2026-05-06
Tau2 Airline Agentic 6 0.65 2026-05-06
Tau2-Bench Telecom Agentic 105 80.7% 2026-05-11
Terminal-Bench Hard Agentic 51 37.1% 2026-05-11
VitaBench Agentic 5 26.30 2026-05-06
OpenUGI Alignment 72 52.09 2026-05-06
OpenUGI Alignment 167 47.45 2026-05-06
OpenUGI Alignment 183 46.52 2026-05-06
TextClass Benchmark Classification 32 1625.36 2026-05-06
CadEval Coding 1 74 2026-05-06
LiveCodeBench Coding 2 75.80 2026-05-06
LiveCodeBench Coding 26 83.914% 2026-05-28
SciCode Coding 79 41% 2026-05-11
MMTU Data 2 0.69 2026-05-06
GSMA Open Telco Leaderboard Domain 10 69.39 2026-05-06
SAGE Education 29 41.771% 2026-05-28
From Perception to Action Embodied AI 8 10.1% 2026-05-28
CorpFin v2 Finance 53 59.713% 2026-05-28
FinanceArena Finance 1 54.1 2026-05-27
MortgageTax Finance 26 65.7% 2026-05-28
PRBench Finance Finance 6 47.69 2026-05-06
TaxEval v2 Finance 18 74.571% 2026-05-28
MageBench Season 1 Game 13 1609 rating / 13 games 2026-05-28
BenchLM General Knowledge 53 58 2026-05-06
Arena-Hard Generalization 1 85.9% 2026-05-27
GDPval Generalization 3 35.2% 2025-09-25
HELM AIR-Bench Generalization 15 0.844661 2026-05-28
HELM Safety Generalization 1 0.981606 2026-05-28
WeirdML Generalization 4 58.21 2026-05-06
HealthBench Healthcare 1 0.5990 2026-05-27
MedCode Healthcare 17 47.29% 2026-05-28
MedQA Healthcare 7 96.058% 2026-04-16
MedScribe Healthcare 33 76.654% 2026-05-28
HUMAINE Human Preference 2 3.79 2026-05-06
AIIQ Composite IQ Intelligence 24 110 2026-05-12
Artificial Analysis Intelligence Index Intelligence 89 38.37 2026-05-11
GPQA Diamond Intelligence 30 84.091% 2026-05-28
Humanity's Last Exam Intelligence 69 20% 2026-05-11
MMLU Pro Intelligence 32 85.595% 2026-05-28
MMLU-Pro Intelligence 29 85.3% 2026-05-11
MMMU Pro Intelligence 26 80.416% 2026-05-28
AraGen v3 Language 3 82.19 2026-05-06
HindiGen v1 Language 1 85.56 2026-05-06
LegalBench Legal 25 83.761% 2026-05-28
Professional Reasoning Bench - Legal Legal 5 48.57 2026-05-06
Fiction.LiveBench Long Context 1 100 2026-05-06
AIME Math 37 85.278% 2026-04-16
AIME 2025 Math 35 88.3% 2026-05-11
IneqMath Math 6 37 2026-05-06
IneqMath Math 12 21 2026-05-06
MATH 500 Math 9 94.6% 2026-01-09
MGSM Math 26 91.746% 2026-01-09
FrontierMath 2025-02-28 Private Mathematics 9 18.69 2026-05-06
FrontierMath Tier 4 2025-07-01 Private Mathematics 5 4.17 2026-05-06
OTIS Mock AIME 2024-2025 Mathematics 9 83.89 2026-05-06
CharXiv-R Multimodal 11 0.79 2026-05-06
MMMU-Pro Multimodal 14 76.40 2026-05-06
MMSI-Bench Multimodal 5 41% 2026-05-28
Video SimpleQA Multimodal 1 66.30 2026-05-06
VideoMMMU Multimodal 11 0.83 2026-05-06
Visual-Language Understanding Multimodal 6 50.07 2026-05-06
Visual-Language Understanding Multimodal 9 49.59 2026-05-06
VPCT Multimodal 4 52 2026-05-06
VTB Multimodal 7 13.74 2026-05-06
Artificial Analysis Openness Index Openness 233 5.56 2026-05-11
ARC-AGI v2 Reasoning 14 0.07 2026-05-06
CAIS Text Capabilities Index Reasoning 23 20.5 2026-05-27
EnigmaEval Reasoning 5 13.09 2026-05-06
EnigmaEval Reasoning 6 11.91 2026-05-06
ERQA Reasoning 5 0.64 2026-05-06
GPQA Diamond Reasoning 79 82.7% 2026-05-11
Humanity's Last Exam (Text Only) Reasoning 12 20.57 2026-05-06
Humanity's Last Exam (Text Only) Reasoning 12 19.78 2026-05-06
SimpleBench Reasoning 6 53.10 2026-05-06
CritPt Science 88 1.1% 2026-05-11
GSO-Bench Science 4 8.80 2026-05-06
LiveSQLBench Text to SQL 15 29.54 2026-05-06
COLLIE Writing 3 0.98 2026-05-06
Lech Mazur Writing Writing 4 8.63 2026-05-06