DeepSeek V3 | BenchmarkList

Metadata

DeepSeek Open source

Aliases: deepseek-chat, deepseek-chat-v3, deepseek-deepseek-chat, deepseek-deepseek-chat-v3, deepseek/deepseek-chat, deepseek/deepseek-chat-v3

Benchmark	Category	Rank	Score	Sampled
ADBench	Agentic	6	80	2026-05-06
AgentIF	Agentic	7	56.7	2026-05-27
Galileo Agent Leaderboard	Agentic	12	0.40	2026-05-06
MCP-Universe	Agentic	28	14.29	2026-05-06
MCPMark	Agentic	27	0.17	2026-05-06
PinchBench	Agentic	54	0.72	2026-05-06
Tau2-Bench Telecom	Agentic	295	22.8%	2026-05-11
Terminal-Bench Hard	Agentic	233	6.8%	2026-05-11
AgentBench FC	Agents	23	36.10	2026-05-06
TextClass Benchmark	Classification	13	1732.54	2026-05-06
BigCodeBench	Coding	2	50	2026-05-06
BigCodeBench-Hard	Coding	21	28.40	2026-05-05
EvalPlus	Coding	4	79.80	2026-05-05
HumanEval-Mul	Coding	1	0.83	2026-05-06
HumanEval+	Coding	5	86.60	2026-05-05
LiveCodeBench	Coding	27	27.20	2026-05-06
MBPP+	Coding	10	73	2026-05-05
SciCode	Coding	188	35.4%	2026-05-11
EduGuardBench	Education	4	0.73	2026-05-27
K-12EduBench	Education	2	79.67	2026-05-27
BizFinBench	Finance	4	71.57	2026-05-27
CorpFin v2	Finance	77	52.486%	2026-05-28
Fin-RATE	Finance	14	9.81%	2026-05-28
Open FinLLM Leaderboard	Finance	9	29.494986%	2026-05-27
TaxEval v2	Finance	76	67.907%	2026-05-28
Xent Games	Game	11	35.48 overall	2026-05-28
BenchLM	General Knowledge	82	36	2026-05-06
CSimpleQA	General Knowledge	7	0.65	2026-05-06
MMLU-Redux	General Knowledge	24	0.89	2026-05-06
HELM AIR-Bench	Generalization	80	0.407885	2026-05-28
HELM Safety	Generalization	45	0.871772	2026-05-28
WeirdML	Generalization	12	41.63	2026-05-06
MedAgentBench	Healthcare	3	62.67%	2026-05-27
MedQA	Healthcare	71	80.9%	2026-04-16
Artificial Analysis Intelligence Index	Intelligence	293	16.46	2026-05-11
GPQA Diamond	Intelligence	89	54.546%	2026-05-28
Humanity's Last Exam	Intelligence	450	3.6%	2026-05-11
MMLU Pro	Intelligence	87	73.82%	2026-05-28
MMLU-Pro	Intelligence	173	75.2%	2026-05-11
HellaSwag	Language	4	88.90	2026-05-06
OpenHuEval	Language	3	57.10	2026-05-06
PIQA	Language	6	84.70	2026-05-06
WinoGrande	Language	5	86.30	2026-05-06
LegalBench	Legal	51	80.762%	2026-05-28
LEXam	Legal	14	52.53% open / 46.57% MCQ	2026-05-28
ConStory-Bench	Long Context	28	CED 2.422	2026-05-28
Fiction.LiveBench	Long Context	12	53.10	2026-05-06
AIME	Math	74	27.5%	2026-04-16
AIME 2025	Math	195	26%	2026-05-11
MATH 500	Math	38	80.4%	2026-01-09
MGSM	Math	23	92.146%	2026-01-09
CNMO 2024	Mathematics	3	0.43	2026-05-06
FrontierMath 2025-02-28 Private	Mathematics	5	22.10	2026-05-06
FrontierMath Tier 4 2025-07-01 Private	Mathematics	7	2.10	2026-05-06
MATH-500	Mathematics	27	0.90	2026-05-06
OTIS Mock AIME 2024-2025	Mathematics	4	87.82	2026-05-06
LanguageBench	Multilingual	8	0.64	2026-05-06
Design Arena	Multimodal	77	1166	2026-05-06
Balrog	Reasoning	11	19.50	2026-05-06
BBH	Reasoning	1	87.50	2026-05-06
CLUEWSC	Reasoning	2	0.91	2026-05-06
DROP	Reasoning	1	0.92	2026-05-06
GPQA Diamond	Reasoning	310	55.7%	2026-05-11
Humanity's Last Exam (Text Only)	Reasoning	47	4.55	2026-05-06
SimpleBench	Reasoning	9	40.80	2026-05-06
ZebraLogic	Reasoning	11	42.10	2026-05-06
CritPt	Science	169	0%	2026-05-11
SciPredict	Science	6	19.18	2026-05-06
FRAMES	Search	2	0.73	2026-05-06
Defects4J	Software Engineering	11	0.399	2026-05-27
RepairBench	Software Engineering	11	0.371	2026-05-27
SWE-PRBench	Software Engineering	3	0.15	2026-05-27
LiveSQLBench	Text to SQL	24	23.68	2026-05-06
Lech Mazur Writing	Writing	7	8.52	2026-05-06

Metadata

Benchmark Results