o3
o-series / OpenAI
97scores
81benchmarks
$2 / $8 per 1M tokenscost in/out
Metadata
o-series Closed/API
Aliases: o3, o3-2025-04-16, openai-o3, openai-o3-2025-04-16, openai/o3, openai/o3-2025-04-16
| Benchmark | Category | Rank | Score | Sampled |
|---|---|---|---|---|
| ADBench | Agentic | 2 | 82 | 2026-05-06 |
| ALFWorld | Agentic | 7 | 0.883 | 2026-05-27 |
| ALFWorld | Agentic | 9 | 0.817 | 2026-05-27 |
| ALFWorld | Agentic | 10 | 0.7 | 2026-05-27 |
| ARC-AGI-1 | Agentic | 50 | 60.83 | 2026-05-05 |
| ARC-AGI-1 | Agentic | 63 | 53.83 | 2026-05-05 |
| ARC-AGI-1 | Agentic | 74 | 41.50 | 2026-05-05 |
| ARC-AGI-2 | Agentic | 58 | 6.53 | 2026-05-05 |
| ARC-AGI-2 | Agentic | 81 | 2.98 | 2026-05-05 |
| ARC-AGI-2 | Agentic | 93 | 1.99 | 2026-05-05 |
| Berkeley Function-Calling Leaderboard | Agentic | 8 | 63.05% | 2026-05-27 |
| Berkeley Function-Calling Leaderboard | Agentic | 30 | 48.56% | 2026-05-27 |
| DEEPSYNTH | Agentic | 9 | 3.29 | 2026-05-27 |
| MCPMark | Agentic | 18 | 0.25 | 2026-05-06 |
| OSWorld | Agentic | 91 | 23.0% | 2026-05-27 |
| OSWorld | Agentic | 96 | 17.17% | 2026-05-27 |
| OSWorld | Agentic | 100 | 9.1% | 2026-05-27 |
| OSWorld-MCP | Agentic | 10 | 24.10 | 2026-05-06 |
| OSWorld-MCP | Agentic | 11 | 17.60 | 2026-05-06 |
| Tau2 Airline | Agentic | 6 | 0.65 | 2026-05-06 |
| Tau2-Bench Telecom | Agentic | 105 | 80.7% | 2026-05-11 |
| Terminal-Bench Hard | Agentic | 51 | 37.1% | 2026-05-11 |
| VitaBench | Agentic | 5 | 26.30 | 2026-05-06 |
| OpenUGI | Alignment | 72 | 52.09 | 2026-05-06 |
| OpenUGI | Alignment | 167 | 47.45 | 2026-05-06 |
| OpenUGI | Alignment | 183 | 46.52 | 2026-05-06 |
| TextClass Benchmark | Classification | 32 | 1625.36 | 2026-05-06 |
| CadEval | Coding | 1 | 74 | 2026-05-06 |
| LiveCodeBench | Coding | 2 | 75.80 | 2026-05-06 |
| LiveCodeBench | Coding | 26 | 83.914% | 2026-05-28 |
| SciCode | Coding | 79 | 41% | 2026-05-11 |
| MMTU | Data | 2 | 0.69 | 2026-05-06 |
| GSMA Open Telco Leaderboard | Domain | 10 | 69.39 | 2026-05-06 |
| SAGE | Education | 29 | 41.771% | 2026-05-28 |
| From Perception to Action | Embodied AI | 8 | 10.1% | 2026-05-28 |
| CorpFin v2 | Finance | 53 | 59.713% | 2026-05-28 |
| FinanceArena | Finance | 1 | 54.1 | 2026-05-27 |
| MortgageTax | Finance | 26 | 65.7% | 2026-05-28 |
| PRBench Finance | Finance | 6 | 47.69 | 2026-05-06 |
| TaxEval v2 | Finance | 18 | 74.571% | 2026-05-28 |
| MageBench Season 1 | Game | 13 | 1609 rating / 13 games | 2026-05-28 |
| BenchLM | General Knowledge | 53 | 58 | 2026-05-06 |
| Arena-Hard | Generalization | 1 | 85.9% | 2026-05-27 |
| GDPval | Generalization | 3 | 35.2% | 2025-09-25 |
| HELM AIR-Bench | Generalization | 15 | 0.844661 | 2026-05-28 |
| HELM Safety | Generalization | 1 | 0.981606 | 2026-05-28 |
| WeirdML | Generalization | 4 | 58.21 | 2026-05-06 |
| HealthBench | Healthcare | 1 | 0.5990 | 2026-05-27 |
| MedCode | Healthcare | 17 | 47.29% | 2026-05-28 |
| MedQA | Healthcare | 7 | 96.058% | 2026-04-16 |
| MedScribe | Healthcare | 33 | 76.654% | 2026-05-28 |
| HUMAINE | Human Preference | 2 | 3.79 | 2026-05-06 |
| AIIQ Composite IQ | Intelligence | 24 | 110 | 2026-05-12 |
| Artificial Analysis Intelligence Index | Intelligence | 89 | 38.37 | 2026-05-11 |
| GPQA Diamond | Intelligence | 30 | 84.091% | 2026-05-28 |
| Humanity's Last Exam | Intelligence | 69 | 20% | 2026-05-11 |
| MMLU Pro | Intelligence | 32 | 85.595% | 2026-05-28 |
| MMLU-Pro | Intelligence | 29 | 85.3% | 2026-05-11 |
| MMMU Pro | Intelligence | 26 | 80.416% | 2026-05-28 |
| AraGen v3 | Language | 3 | 82.19 | 2026-05-06 |
| HindiGen v1 | Language | 1 | 85.56 | 2026-05-06 |
| LegalBench | Legal | 25 | 83.761% | 2026-05-28 |
| Professional Reasoning Bench - Legal | Legal | 5 | 48.57 | 2026-05-06 |
| Fiction.LiveBench | Long Context | 1 | 100 | 2026-05-06 |
| AIME | Math | 37 | 85.278% | 2026-04-16 |
| AIME 2025 | Math | 35 | 88.3% | 2026-05-11 |
| IneqMath | Math | 6 | 37 | 2026-05-06 |
| IneqMath | Math | 12 | 21 | 2026-05-06 |
| MATH 500 | Math | 9 | 94.6% | 2026-01-09 |
| MGSM | Math | 26 | 91.746% | 2026-01-09 |
| FrontierMath 2025-02-28 Private | Mathematics | 9 | 18.69 | 2026-05-06 |
| FrontierMath Tier 4 2025-07-01 Private | Mathematics | 5 | 4.17 | 2026-05-06 |
| OTIS Mock AIME 2024-2025 | Mathematics | 9 | 83.89 | 2026-05-06 |
| CharXiv-R | Multimodal | 11 | 0.79 | 2026-05-06 |
| MMMU-Pro | Multimodal | 14 | 76.40 | 2026-05-06 |
| MMSI-Bench | Multimodal | 5 | 41% | 2026-05-28 |
| Video SimpleQA | Multimodal | 1 | 66.30 | 2026-05-06 |
| VideoMMMU | Multimodal | 11 | 0.83 | 2026-05-06 |
| Visual-Language Understanding | Multimodal | 6 | 50.07 | 2026-05-06 |
| Visual-Language Understanding | Multimodal | 9 | 49.59 | 2026-05-06 |
| VPCT | Multimodal | 4 | 52 | 2026-05-06 |
| VTB | Multimodal | 7 | 13.74 | 2026-05-06 |
| Artificial Analysis Openness Index | Openness | 233 | 5.56 | 2026-05-11 |
| ARC-AGI v2 | Reasoning | 14 | 0.07 | 2026-05-06 |
| CAIS Text Capabilities Index | Reasoning | 23 | 20.5 | 2026-05-27 |
| EnigmaEval | Reasoning | 5 | 13.09 | 2026-05-06 |
| EnigmaEval | Reasoning | 6 | 11.91 | 2026-05-06 |
| ERQA | Reasoning | 5 | 0.64 | 2026-05-06 |
| GPQA Diamond | Reasoning | 79 | 82.7% | 2026-05-11 |
| Humanity's Last Exam (Text Only) | Reasoning | 12 | 20.57 | 2026-05-06 |
| Humanity's Last Exam (Text Only) | Reasoning | 12 | 19.78 | 2026-05-06 |
| SimpleBench | Reasoning | 6 | 53.10 | 2026-05-06 |
| CritPt | Science | 88 | 1.1% | 2026-05-11 |
| GSO-Bench | Science | 4 | 8.80 | 2026-05-06 |
| LiveSQLBench | Text to SQL | 15 | 29.54 | 2026-05-06 |
| COLLIE | Writing | 3 | 0.98 | 2026-05-06 |
| Lech Mazur Writing | Writing | 4 | 8.63 | 2026-05-06 |
No matching rows.