o1 | BenchmarkList

Metadata

o-series Closed/API

Aliases: o1, o1-2024-12-17, openai-o1, openai-o1-2024-12-17, openai/o1, openai/o1-2024-12-17

Benchmark	Category	Rank	Score	Sampled
Tau2-Bench Telecom	Agentic	157	62.6%	2026-05-11
Terminal-Bench Hard	Agentic	192	12.9%	2026-05-11
OpenUGI	Alignment	299	42.25	2026-05-06
OpenUGI	Alignment	481	37.24	2026-05-06
TextClass Benchmark	Classification	6	1768.81	2026-05-06
BigCodeBench-Hard	Coding	2	32.40	2026-05-05
BigCodeBench-Hard	Coding	13	29.70	2026-05-05
BigCodeBench-Hard	Coding	20	28.40	2026-05-05
CadEval	Coding	4	56	2026-05-06
LiveCodeBench	Coding	87	50.264%	2026-05-28
SciCode	Coding	182	35.8%	2026-05-11
GSMA Open Telco Leaderboard	Domain	13	68.08	2026-05-06
TaxEval v2	Finance	22	74.284%	2026-05-28
BenchLM	General Knowledge	55	58	2026-05-06
Arena-Hard	Generalization	11	55.9%	2026-05-27
HELM AIR-Bench	Generalization	23	0.799614	2026-05-28
HELM Safety	Generalization	4	0.975800	2026-05-28
WeirdML	Generalization	8	47.56	2026-05-06
HealthBench	Healthcare	3	0.4200	2026-05-27
MedQA	Healthcare	1	96.517%	2026-04-16
HUMAINE	Human Preference	30	3.44	2026-05-06
AIIQ Composite IQ	Intelligence	36	91	2026-05-12
Artificial Analysis Intelligence Index	Intelligence	143	30.75	2026-05-11
GPQA Diamond	Intelligence	59	73.232%	2026-05-28
Humanity's Last Exam	Intelligence	196	7.7%	2026-05-11
MathVision	Intelligence	39	60.30	2026-05-06
MathVista	Intelligence	8	73.90	2026-05-06
MMLU Pro	Intelligence	46	83.488%	2026-05-28
MMLU-Pro	Intelligence	42	84.1%	2026-05-11
MMMU Pro	Intelligence	33	77.412%	2026-05-28
SimpleQA	Intelligence	5	42.6%	2026-05-27
SuperGPQA	Intelligence	2	60.24	2026-05-06
AraGen v3	Language	1	84.29	2026-05-06
HindiGen v1	Language	2	79.64	2026-05-06
LegalBench	Legal	54	80.393%	2026-05-28
Fiction.LiveBench	Long Context	11	53.10	2026-05-06
AIME	Math	53	71.458%	2026-04-16
IneqMath	Math	22	8	2026-05-06
IneqMath	Math	23	7.50	2026-05-06
MATH 500	Math	25	90.4%	2026-01-09
MGSM	Math	49	89.309%	2026-01-09
FrontierMath 2025-02-28 Private	Mathematics	11	9.31	2026-05-06
OTIS Mock AIME 2024-2025	Mathematics	15	73.33	2026-05-06
Visual-Language Understanding	Multimodal	23	45.25	2026-05-06
VPCT	Multimodal	10	37	2026-05-06
EnigmaEval	Reasoning	13	5.65	2026-05-06
GPQA Diamond	Reasoning	164	74.7%	2026-05-11
Humanity's Last Exam (Text Only)	Reasoning	34	7.75	2026-05-06
SimpleBench	Reasoning	8	41.70	2026-05-06
ZebraLogic	Reasoning	3	81	2026-05-06
X-Risks Leaderboard	Safety	1	29.09	2026-05-06
CritPt	Science	145	0.3%	2026-05-11
SWE-Lancer	Software Engineering	1	28.4%	2025-07-17
Lech Mazur Writing	Writing	23	7.02	2026-05-06

Metadata

Benchmark Results