R1 | BenchmarkList

Metadata

DeepSeek Open source

Aliases: deepseek-deepseek-r1, deepseek-r1, deepseek/deepseek-r1

Benchmark	Category	Rank	Score	Sampled
AgentIF	Agentic	5	57.9	2026-05-27
ARC-AGI-1	Agentic	119	15.80	2026-05-05
ARC-AGI-2	Agentic	107	1.30	2026-05-05
LLM-WikiRace	Agentic	7	54.70	2026-05-06
t2-bench	Agentic	11	0.80	2026-05-06
Tau2-Bench Telecom	Agentic	210	36.5%	2026-05-11
Tau2-Bench Telecom	Agentic	366	11.4%	2026-05-11
Terminal-Bench Hard	Agentic	172	15.9%	2026-05-11
Terminal-Bench Hard	Agentic	250	6.1%	2026-05-11
Toolathlon	Agentic	15	0.35	2026-05-06
OpenUGI	Alignment	91	51	2026-05-06
TextClass Benchmark	Classification	16	1718.73	2026-05-06
BigCodeBench-Hard	Coding	14	29.70	2026-05-05
LiveCodeBench	Coding	62	70.221%	2026-05-28
Long Code Arena	Coding	3	0.80	2026-05-06
SciCode	Coding	94	40.3%	2026-05-11
SciCode	Coding	185	35.7%	2026-05-11
TuRTLe Code Completion (Icarus Verilog)	Coding	6	77.00	2026-05-06
TuRTLe Code Completion (Verilator)	Coding	5	75.99	2026-05-06
TuRTLe Spec-to-RTL (Icarus Verilog)	Coding	5	75.53	2026-05-06
TuRTLe Spec-to-RTL (Verilator)	Coding	5	75.78	2026-05-06
IslamicLegalBench	Domain	8	54.21	2026-05-06
EduGuardBench	Education	2	0.75	2026-05-27
K-12EduBench	Education	10	69.13	2026-05-27
Vectara HHEM Hallucination Leaderboard	Factuality	69	88.70	2026-05-06
BizFinBench	Finance	2	73.05	2026-05-27
CorpFin v2	Finance	72	54.118%	2026-05-28
Fin-RATE	Finance	11	15.53%	2026-05-28
FinChain	Finance	17	53.75 ChainEval	2026-05-28
TaxEval v2	Finance	43	72.281%	2026-05-28
Xent Games	Game	4	62.67 overall	2026-05-28
ALL Bench LLM	General Knowledge	12	36.98	2026-05-06
BenchLM	General Knowledge	86	33	2026-05-06
Arena-Hard	Generalization	10	58.0%	2026-05-27
HELM AIR-Bench	Generalization	66	0.529066	2026-05-28
HELM Safety	Generalization	46	0.868314	2026-05-28
HELM Safety	Generalization	47	0.865442	2026-05-28
LongBench v2	Generalization	4	58.3%	2026-05-27
WeirdML	Generalization	18	36.49	2026-05-06
HealthBench Hard	Healthcare	10	0.49	2026-05-27
HELM MedQA	Healthcare	9	0.856859	2026-05-28
MedQA	Healthcare	44	90.8%	2026-04-16
Artificial Analysis Intelligence Index	Intelligence	174	27.07	2026-05-11
Artificial Analysis Intelligence Index	Intelligence	253	18.84	2026-05-11
Humanity's Last Exam	Intelligence	96	14.9%	2026-05-11
Humanity's Last Exam	Intelligence	167	9.3%	2026-05-11
MMLU Pro	Intelligence	47	83.184%	2026-05-28
MMLU-Pro	Intelligence	35	84.9%	2026-05-11
MMLU-Pro	Intelligence	37	84.4%	2026-05-11
SuperGPQA	Intelligence	1	61.82	2026-05-06
OpenHuEval	Language	2	62.31	2026-05-06
J1-ENVS	Legal	13	43.48	2026-05-26
LegalBench	Legal	95	67.323%	2026-05-28
LEXam	Legal	11	55.91% open / 52.41% MCQ	2026-05-28
ConStory-Bench	Long Context	31	CED 3.419	2026-05-28
Fiction.LiveBench	Long Context	21	33.30	2026-05-06
AIME	Math	52	73.958%	2026-04-16
AIME 2025	Math	78	76%	2026-05-11
AIME 2025	Math	101	68%	2026-05-11
IneqMath	Math	30	5	2026-05-06
IneqMath	Math	31	5	2026-05-06
IneqMath	Math	35	3.50	2026-05-06
IneqMath	Math	51	0.50	2026-05-06
MATH 500	Math	18	92.2%	2026-01-09
MGSM	Math	20	92.254%	2026-01-09
HMMT 2025	Mathematics	16	0.90	2026-05-06
OTIS Mock AIME 2024-2025	Mathematics	19	53.33	2026-05-06
BRIDGE Medical Leaderboard	Medical	9	51.38	2026-05-27
BRIDGE Medical Leaderboard	Medical	55	44.25	2026-05-27
BRIDGE Medical Leaderboard	Medical	75	42.1	2026-05-27
LiveMedBench	Medical	17	0.1329	2026-05-27
MedHELM	Medical	1	0.6625	2026-05-27
MEDIC Benchmark	Medical	92	35.5 average normalized public table score	2026-05-27
LanguageBench	Multilingual	28	0.17	2026-05-06
ALL Bench Multimodal	Multimodal	13	35.21	2026-05-06
Math-VR	Multimodal	12	49.5	2026-05-27
Artificial Analysis Openness Index	Openness	44	50	2026-05-11
Balrog	Reasoning	3	34.90	2026-05-06
CAIS Text Capabilities Index	Reasoning	35	8.6	2026-05-27
GPQA Diamond	Reasoning	90	81.3%	2026-05-11
GPQA Diamond	Reasoning	198	70.8%	2026-05-11
Humanity's Last Exam (Text Only)	Reasoning	31	8.54	2026-05-06
LingOly-TOO	Reasoning	9	0.26	2026-05-06
MultiNRC	Reasoning	22	24.27	2026-05-06
SimpleBench	Reasoning	14	30.90	2026-05-06
ZebraLogic	Reasoning	4	78.70	2026-05-06
CAIS Risk Index	Safety	26	57.4	2026-05-27
CritPt	Science	65	1.4%	2026-05-11
CritPt	Science	106	0.6%	2026-05-11
BrowseComp-zh	Search	6	0.65	2026-05-06
Defects4J	Software Engineering	4	0.475	2026-05-27
RepairBench	Software Engineering	3	0.452	2026-05-27
LiveSQLBench	Text to SQL	18	26.90	2026-05-06
Lech Mazur Writing	Writing	10	8.30	2026-05-06

Metadata

Benchmark Results