GPT-4.1
GPT / OpenAI
107scores
103benchmarks
$2 / $8 per 1M tokenscost in/out
Metadata
GPT Closed/API
Aliases: gpt-4.1, gpt-4.1-2025-04-14, openai-gpt-4.1, openai-gpt-4.1-2025-04-14, openai/gpt-4.1, openai/gpt-4.1-2025-04-14
| Benchmark | Category | Rank | Score | Sampled |
|---|---|---|---|---|
| ARC-AGI-1 | Agentic | 133 | 5.50 | 2026-05-05 |
| ARC-AGI-2 | Agentic | 126 | 0.42 | 2026-05-05 |
| Berkeley Function-Calling Leaderboard | Agentic | 20 | 53.96% | 2026-05-27 |
| Berkeley Function-Calling Leaderboard | Agentic | 45 | 39.38% | 2026-05-27 |
| CAR-bench | Agentic | 8 | 0.37 | 2026-05-06 |
| DEEPSYNTH | Agentic | 8 | 3.46 | 2026-05-27 |
| Galileo Agent Leaderboard | Agentic | 1 | 0.62 | 2026-05-06 |
| Gert Labs Rankings | Agentic | 58 | 0.28 | 2026-05-11 |
| MCP-Universe | Agentic | 24 | 18.18 | 2026-05-06 |
| MCPMark | Agentic | 33 | 0.08 | 2026-05-06 |
| MultiChallenge | Agentic | 27 | 39.43 | 2026-05-06 |
| RealDataAgentBench | Agentic | 1 | 0.88 | 2026-04-28 |
| Tau2-Bench Telecom | Agentic | 184 | 47.1% | 2026-05-11 |
| Terminal-Bench Hard | Agentic | 186 | 13.6% | 2026-05-11 |
| UAVBench | Agentic | 5 | 79.05 | 2026-05-06 |
| OpenUGI | Alignment | 162 | 47.53 | 2026-05-06 |
| TextClass Benchmark | Classification | 63 | 1520.39 | 2026-05-06 |
| ALE-Bench | Coding | 66 | 558.10 | 2026-05-06 |
| BigCodeBench-Hard | Coding | 7 | 31.80 | 2026-05-05 |
| CadEval | Coding | 6 | 42 | 2026-05-06 |
| LiveCodeBench | Coding | 84 | 54.666% | 2026-05-28 |
| SciCode | Coding | 136 | 38.1% | 2026-05-11 |
| Terminal-Bench 2.0 | Coding | 61 | 14.607% | 2026-05-28 |
| RP-Bench | Creative | 6 | 1522.70 | 2026-05-06 |
| RP-Bench | Creative | 8 | 1509.40 | 2026-05-06 |
| RP-Bench | Creative | 24 | 4.31 | 2026-05-06 |
| GSMA Open Telco Leaderboard | Domain | 23 | 63.39 | 2026-05-06 |
| Vectara HHEM Hallucination Leaderboard | Factuality | 21 | 94.40 | 2026-05-06 |
| CorpFin v2 | Finance | 28 | 63.054% | 2026-05-28 |
| Fin-RATE | Finance | 2 | 33.24% | 2026-05-28 |
| Fin-RATE | Finance | 3 | 31.80% | 2026-05-28 |
| FinChain | Finance | 11 | 56.92 ChainEval | 2026-05-28 |
| MortgageTax | Finance | 24 | 65.938% | 2026-05-28 |
| PRBench Finance | Finance | 24 | 34.32 | 2026-05-06 |
| TaxEval v2 | Finance | 11 | 75.061% | 2026-05-28 |
| BenchLM | General Knowledge | 51 | 58 | 2026-05-06 |
| Arena-Hard | Generalization | 14 | 50.0% | 2026-05-27 |
| HELM AIR-Bench | Generalization | 47 | 0.647875 | 2026-05-28 |
| HELM Safety | Generalization | 11 | 0.962853 | 2026-05-28 |
| WeirdML | Generalization | 16 | 39.37 | 2026-05-06 |
| GeoCode Leaderboard | Geospatial | 3 | 70.93% pass@1 | 2026-05-28 |
| GeoRC | Geospatial | 5 | 42.3 | 2026-05-27 |
| HealthBench | Healthcare | 2 | 0.4778 | 2026-05-27 |
| MedQA | Healthcare | 40 | 91.183% | 2026-04-16 |
| HUMAINE | Human Preference | 24 | 3.53 | 2026-05-06 |
| Multi-IF | Instruction Following | 15 | 0.71 | 2026-05-06 |
| Artificial Analysis Intelligence Index | Intelligence | 180 | 26.28 | 2026-05-11 |
| GPQA Diamond | Intelligence | 75 | 65.404% | 2026-05-28 |
| Humanity's Last Exam | Intelligence | 345 | 4.6% | 2026-05-11 |
| MMLU Pro | Intelligence | 59 | 80.495% | 2026-05-28 |
| MMLU-Pro | Intelligence | 104 | 80.6% | 2026-05-11 |
| MMMU Pro | Intelligence | 45 | 72.386% | 2026-05-28 |
| SimpleQA | Intelligence | 7 | 41.6% | 2026-05-27 |
| AraGen v3 | Language | 9 | 74.54 | 2026-05-06 |
| HellaSwag | Language | 1 | 95.30 | 2026-05-06 |
| HindiGen v1 | Language | 9 | 73.37 | 2026-05-06 |
| WinoGrande | Language | 3 | 87.50 | 2026-05-06 |
| CaseLaw v2 | Legal | 3 | 69.882% | 2026-05-04 |
| LegalBench | Legal | 32 | 83.1% | 2026-05-28 |
| LEXam | Legal | 6 | 57.50% open / 54.40% MCQ | 2026-05-28 |
| Professional Reasoning Bench - Legal | Legal | 23 | 36.48 | 2026-05-06 |
| Graphwalks BFS >128k | Long Context | 5 | 0.19 | 2026-05-06 |
| Graphwalks parents >128k | Long Context | 4 | 0.25 | 2026-05-06 |
| OpenAI-MRCR: 2 needle 128k | Long Context | 4 | 0.57 | 2026-05-06 |
| OpenAI-MRCR: 2 needle 1M | Long Context | 3 | 0.46 | 2026-05-06 |
| Fiction.LiveBench | Long Context | 8 | 63.90 | 2026-05-06 |
| AIME | Math | 70 | 39.583% | 2026-04-16 |
| AIME 2025 | Math | 175 | 34.7% | 2026-05-11 |
| IneqMath | Math | 41 | 2.50 | 2026-05-06 |
| JEEBench | Math | 5 | 0.292 | 2026-05-27 |
| MATH 500 | Math | 33 | 87.2% | 2026-01-09 |
| MGSM | Math | 59 | 87.673% | 2026-01-09 |
| FrontierMath 2025-02-28 Private | Mathematics | 15 | 5.52 | 2026-05-06 |
| FrontierMath Tier 4 2025-07-01 Private | Mathematics | 11 | 0 | 2026-05-06 |
| HMMT 2025 | Mathematics | 32 | 0.29 | 2026-05-06 |
| OTIS Mock AIME 2024-2025 | Mathematics | 21 | 38.33 | 2026-05-06 |
| LiveMedBench | Medical | 14 | 0.1379 | 2026-05-27 |
| MEDIC Benchmark | Medical | 2 | 91.71 average normalized public table score | 2026-05-27 |
| MedSafe-Dx | Medical | 5 | 87.6 | 2026-05-27 |
| AfroBench-Lite | Multilingual | 9 | 65.67 | 2026-05-06 |
| LanguageBench | Multilingual | 6 | 0.66 | 2026-05-06 |
| CharXiv-D | Multimodal | 5 | 0.88 | 2026-05-06 |
| CharXiv-R | Multimodal | 26 | 0.57 | 2026-05-06 |
| Design Arena | Multimodal | 99 | 1084 | 2026-05-06 |
| IDP Leaderboard | Multimodal | 18 | 67.99 | 2026-05-06 |
| Math-VR | Multimodal | 18 | 26.0 | 2026-05-27 |
| MMLongBench-Doc | Multimodal | 10 | 49.70 | 2026-05-06 |
| MMSI-Bench | Multimodal | 13 | 30.9% | 2026-05-28 |
| Visual-Language Understanding | Multimodal | 20 | 45.34 | 2026-05-06 |
| VPCT | Multimodal | 6 | 45 | 2026-05-06 |
| VTB | Multimodal | 11 | 5.52 | 2026-05-06 |
| BBH | Reasoning | 6 | 75.12 | 2026-05-06 |
| EnigmaEval | Reasoning | 26 | 2.17 | 2026-05-06 |
| GPQA Diamond | Reasoning | 236 | 66.6% | 2026-05-11 |
| Graphwalks BFS <128k | Reasoning | 7 | 0.62 | 2026-05-06 |
| Graphwalks parents <128k | Reasoning | 8 | 0.58 | 2026-05-06 |
| Humanity's Last Exam (Text Only) | Reasoning | 45 | 4.97 | 2026-05-06 |
| MultiNRC | Reasoning | 27 | 21.23 | 2026-05-06 |
| SimpleBench | Reasoning | 12 | 34.50 | 2026-05-06 |
| Halluverse-M3 | Safety | 2 | 78.66% | 2026-05-28 |
| CritPt | Science | 213 | 0% | 2026-05-11 |
| Defects4J | Software Engineering | 5 | 0.452 | 2026-05-27 |
| RepairBench | Software Engineering | 6 | 0.413 | 2026-05-27 |
| Structured Output Benchmark | Structured Output | 15 | 85 | 2026-05-06 |
| ComplexFuncBench | Tool Use | 2 | 0.66 | 2026-05-06 |
| COLLIE | Writing | 5 | 0.66 | 2026-05-06 |
| Lech Mazur Writing | Writing | 18 | 7.56 | 2026-05-06 |
No matching rows.