GPT-4.1 | BenchmarkList

Metadata

GPT Closed/API

Aliases: gpt-4.1, gpt-4.1-2025-04-14, openai-gpt-4.1, openai-gpt-4.1-2025-04-14, openai/gpt-4.1, openai/gpt-4.1-2025-04-14

Benchmark	Category	Rank	Score	Sampled
ARC-AGI-1	Agentic	133	5.50	2026-05-05
ARC-AGI-2	Agentic	126	0.42	2026-05-05
Berkeley Function-Calling Leaderboard	Agentic	20	53.96%	2026-05-27
Berkeley Function-Calling Leaderboard	Agentic	45	39.38%	2026-05-27
CAR-bench	Agentic	8	0.37	2026-05-06
DEEPSYNTH	Agentic	8	3.46	2026-05-27
Galileo Agent Leaderboard	Agentic	1	0.62	2026-05-06
Gert Labs Rankings	Agentic	58	0.28	2026-05-11
MCP-Universe	Agentic	24	18.18	2026-05-06
MCPMark	Agentic	33	0.08	2026-05-06
MultiChallenge	Agentic	27	39.43	2026-05-06
RealDataAgentBench	Agentic	1	0.88	2026-04-28
Tau2-Bench Telecom	Agentic	184	47.1%	2026-05-11
Terminal-Bench Hard	Agentic	186	13.6%	2026-05-11
UAVBench	Agentic	5	79.05	2026-05-06
OpenUGI	Alignment	162	47.53	2026-05-06
TextClass Benchmark	Classification	63	1520.39	2026-05-06
ALE-Bench	Coding	66	558.10	2026-05-06
BigCodeBench-Hard	Coding	7	31.80	2026-05-05
CadEval	Coding	6	42	2026-05-06
LiveCodeBench	Coding	84	54.666%	2026-05-28
SciCode	Coding	136	38.1%	2026-05-11
Terminal-Bench 2.0	Coding	61	14.607%	2026-05-28
RP-Bench	Creative	6	1522.70	2026-05-06
RP-Bench	Creative	8	1509.40	2026-05-06
RP-Bench	Creative	24	4.31	2026-05-06
GSMA Open Telco Leaderboard	Domain	23	63.39	2026-05-06
Vectara HHEM Hallucination Leaderboard	Factuality	21	94.40	2026-05-06
CorpFin v2	Finance	28	63.054%	2026-05-28
Fin-RATE	Finance	2	33.24%	2026-05-28
Fin-RATE	Finance	3	31.80%	2026-05-28
FinChain	Finance	11	56.92 ChainEval	2026-05-28
MortgageTax	Finance	24	65.938%	2026-05-28
PRBench Finance	Finance	24	34.32	2026-05-06
TaxEval v2	Finance	11	75.061%	2026-05-28
BenchLM	General Knowledge	51	58	2026-05-06
Arena-Hard	Generalization	14	50.0%	2026-05-27
HELM AIR-Bench	Generalization	47	0.647875	2026-05-28
HELM Safety	Generalization	11	0.962853	2026-05-28
WeirdML	Generalization	16	39.37	2026-05-06
GeoCode Leaderboard	Geospatial	3	70.93% pass@1	2026-05-28
GeoRC	Geospatial	5	42.3	2026-05-27
HealthBench	Healthcare	2	0.4778	2026-05-27
MedQA	Healthcare	40	91.183%	2026-04-16
HUMAINE	Human Preference	24	3.53	2026-05-06
Multi-IF	Instruction Following	15	0.71	2026-05-06
Artificial Analysis Intelligence Index	Intelligence	180	26.28	2026-05-11
GPQA Diamond	Intelligence	75	65.404%	2026-05-28
Humanity's Last Exam	Intelligence	345	4.6%	2026-05-11
MMLU Pro	Intelligence	59	80.495%	2026-05-28
MMLU-Pro	Intelligence	104	80.6%	2026-05-11
MMMU Pro	Intelligence	45	72.386%	2026-05-28
SimpleQA	Intelligence	7	41.6%	2026-05-27
AraGen v3	Language	9	74.54	2026-05-06
HellaSwag	Language	1	95.30	2026-05-06
HindiGen v1	Language	9	73.37	2026-05-06
WinoGrande	Language	3	87.50	2026-05-06
CaseLaw v2	Legal	3	69.882%	2026-05-04
LegalBench	Legal	32	83.1%	2026-05-28
LEXam	Legal	6	57.50% open / 54.40% MCQ	2026-05-28
Professional Reasoning Bench - Legal	Legal	23	36.48	2026-05-06
Graphwalks BFS >128k	Long Context	5	0.19	2026-05-06
Graphwalks parents >128k	Long Context	4	0.25	2026-05-06
OpenAI-MRCR: 2 needle 128k	Long Context	4	0.57	2026-05-06
OpenAI-MRCR: 2 needle 1M	Long Context	3	0.46	2026-05-06
Fiction.LiveBench	Long Context	8	63.90	2026-05-06
AIME	Math	70	39.583%	2026-04-16
AIME 2025	Math	175	34.7%	2026-05-11
IneqMath	Math	41	2.50	2026-05-06
JEEBench	Math	5	0.292	2026-05-27
MATH 500	Math	33	87.2%	2026-01-09
MGSM	Math	59	87.673%	2026-01-09
FrontierMath 2025-02-28 Private	Mathematics	15	5.52	2026-05-06
FrontierMath Tier 4 2025-07-01 Private	Mathematics	11	0	2026-05-06
HMMT 2025	Mathematics	32	0.29	2026-05-06
OTIS Mock AIME 2024-2025	Mathematics	21	38.33	2026-05-06
LiveMedBench	Medical	14	0.1379	2026-05-27
MEDIC Benchmark	Medical	2	91.71 average normalized public table score	2026-05-27
MedSafe-Dx	Medical	5	87.6	2026-05-27
AfroBench-Lite	Multilingual	9	65.67	2026-05-06
LanguageBench	Multilingual	6	0.66	2026-05-06
CharXiv-D	Multimodal	5	0.88	2026-05-06
CharXiv-R	Multimodal	26	0.57	2026-05-06
Design Arena	Multimodal	99	1084	2026-05-06
IDP Leaderboard	Multimodal	18	67.99	2026-05-06
Math-VR	Multimodal	18	26.0	2026-05-27
MMLongBench-Doc	Multimodal	10	49.70	2026-05-06
MMSI-Bench	Multimodal	13	30.9%	2026-05-28
Visual-Language Understanding	Multimodal	20	45.34	2026-05-06
VPCT	Multimodal	6	45	2026-05-06
VTB	Multimodal	11	5.52	2026-05-06
BBH	Reasoning	6	75.12	2026-05-06
EnigmaEval	Reasoning	26	2.17	2026-05-06
GPQA Diamond	Reasoning	236	66.6%	2026-05-11
Graphwalks BFS <128k	Reasoning	7	0.62	2026-05-06
Graphwalks parents <128k	Reasoning	8	0.58	2026-05-06
Humanity's Last Exam (Text Only)	Reasoning	45	4.97	2026-05-06
MultiNRC	Reasoning	27	21.23	2026-05-06
SimpleBench	Reasoning	12	34.50	2026-05-06
Halluverse-M3	Safety	2	78.66%	2026-05-28
CritPt	Science	213	0%	2026-05-11
Defects4J	Software Engineering	5	0.452	2026-05-27
RepairBench	Software Engineering	6	0.413	2026-05-27
Structured Output Benchmark	Structured Output	15	85	2026-05-06
ComplexFuncBench	Tool Use	2	0.66	2026-05-06
COLLIE	Writing	5	0.66	2026-05-06
Lech Mazur Writing	Writing	18	7.56	2026-05-06

Metadata

Benchmark Results