GPT-4.1 Mini | BenchmarkList

Metadata

GPT Closed/API

Aliases: gpt-4.1-mini, gpt-4.1-mini-2025-04-14, openai-gpt-4.1-mini, openai-gpt-4.1-mini-2025-04-14, openai/gpt-4.1-mini, openai/gpt-4.1-mini-2025-04-14

Benchmark	Category	Rank	Score	Sampled
ARC-AGI-1	Agentic	140	3.50	2026-05-05
ARC-AGI-2	Agentic	136	0	2026-05-05
Berkeley Function-Calling Leaderboard	Agentic	27	50.45%	2026-05-27
Berkeley Function-Calling Leaderboard	Agentic	67	29.73%	2026-05-27
Galileo Agent Leaderboard	Agentic	3	0.56	2026-05-06
Hindsight LLM Memory Leaderboard	Agentic	4	86.40	2026-05-06
MCPMark	Agentic	38	0.04	2026-05-06
RealDataAgentBench	Agentic	2	0.87	2026-04-28
Tau2-Bench Telecom	Agentic	172	52.9%	2026-05-11
Terminal-Bench Hard	Agentic	227	7.6%	2026-05-11
UAVBench	Agentic	6	78.10	2026-05-06
TextClass Benchmark	Classification	52	1547.62	2026-05-06
BigCodeBench	Coding	8	48.90	2026-05-06
BigCodeBench-Hard	Coding	8	31.80	2026-05-05
CadEval	Coding	10	16	2026-05-06
LiveCodeBench	Coding	80	58.158%	2026-05-28
SciCode	Coding	90	40.4%	2026-05-11
GSMA Open Telco Leaderboard	Domain	37	58.02	2026-05-06
CorpFin v2	Finance	63	57.926%	2026-05-28
FinanceArena	Finance	12	41.9	2026-05-27
FinChain	Finance	8	57.24 ChainEval	2026-05-28
MortgageTax	Finance	27	65.501%	2026-05-28
PRBench Finance	Finance	27	30.45	2026-05-06
TaxEval v2	Finance	48	71.914%	2026-05-28
BenchLM	General Knowledge	70	46	2026-05-06
Arena-Hard	Generalization	15	46.9%	2026-05-27
HELM AIR-Bench	Generalization	56	0.604408	2026-05-28
HELM Safety	Generalization	15	0.948914	2026-05-28
WeirdML	Generalization	17	37.61	2026-05-06
GeoCode Leaderboard	Geospatial	8	66.56% pass@1	2026-05-28
HealthBench Hard	Healthcare	22	0.4	2026-05-27
MedQA	Healthcare	61	84.633%	2026-04-16
Multi-IF	Instruction Following	17	0.67	2026-05-06
Artificial Analysis Intelligence Index	Intelligence	218	22.9	2026-05-11
GPQA Diamond	Intelligence	73	67.929%	2026-05-28
Humanity's Last Exam	Intelligence	346	4.6%	2026-05-11
MMLU Pro	Intelligence	78	77.225%	2026-05-28
MMLU-Pro	Intelligence	141	78.1%	2026-05-11
MMMU Pro	Intelligence	51	70.537%	2026-05-28
SimpleQA	Intelligence	17	16.8%	2026-05-27
HindiGen v1	Language	16	65.02	2026-05-06
LegalBench	Legal	71	78.044%	2026-05-28
LEXam	Legal	13	54.58% open / 48.49% MCQ	2026-05-28
Professional Reasoning Bench - Legal	Legal	27	30.38	2026-05-06
Graphwalks BFS >128k	Long Context	6	0.15	2026-05-06
Graphwalks parents >128k	Long Context	5	0.11	2026-05-06
OpenAI-MRCR: 2 needle 128k	Long Context	5	0.47	2026-05-06
OpenAI-MRCR: 2 needle 1M	Long Context	4	0.33	2026-05-06
Fiction.LiveBench	Long Context	14	46.90	2026-05-06
AIME	Math	63	49.375%	2026-04-16
AIME 2025	Math	148	46.3%	2026-05-11
MATH 500	Math	32	88%	2026-01-09
MGSM	Math	58	87.782%	2026-01-09
FrontierMath 2025-02-28 Private	Mathematics	16	4.48	2026-05-06
HMMT 2025	Mathematics	30	0.35	2026-05-06
OTIS Mock AIME 2024-2025	Mathematics	20	44.72	2026-05-06
LiveMedBench	Medical	21	0.1036	2026-05-27
MEDIC Benchmark	Medical	35	65.49 average normalized public table score	2026-05-27
LanguageBench	Multilingual	11	0.60	2026-05-06
CharXiv-D	Multimodal	4	0.88	2026-05-06
CharXiv-R	Multimodal	25	0.57	2026-05-06
Design Arena	Multimodal	107	1052	2026-05-06
Math-VR	Multimodal	15	33.3	2026-05-27
Visual-Language Understanding	Multimodal	39	41.14	2026-05-06
GPQA Diamond	Reasoning	238	66.4%	2026-05-11
Graphwalks BFS <128k	Reasoning	7	0.62	2026-05-06
Graphwalks parents <128k	Reasoning	6	0.60	2026-05-06
LiveSecBench	Safety	40	22.99	2026-05-27
CritPt	Science	214	0%	2026-05-11
StructEval	Structured Output	2	75.64%	2026-05-28
ComplexFuncBench	Tool Use	4	0.49	2026-05-06
COLLIE	Writing	8	0.55	2026-05-06

Metadata

Benchmark Results