Claude 3.7 Sonnet | BenchmarkList

Metadata

Claude Closed/API

Aliases: anthropic-claude-3-7-sonnet-20250219, anthropic-claude-3.7-sonnet, anthropic/claude-3-7-sonnet-20250219, anthropic/claude-3.7-sonnet, claude-3-7-sonnet-20250219, claude-3.7-sonnet

Benchmark	Category	Rank	Score	Sampled
ALFWorld	Agentic	8	0.833	2026-05-27
MCP-Universe	Agentic	14	24.24	2026-05-06
OSWorld	Agentic	62	35.8%	2026-05-27
OSWorld	Agentic	63	35.6%	2026-05-27
OSWorld	Agentic	80	27.1%	2026-05-27
Tau2-Bench Telecom	Agentic	177	50%	2026-05-11
Terminal-Bench Hard	Agentic	138	21.2%	2026-05-11
WildAgtEval	Agentic	5	61.6%	2026-05-28
OpenUGI	Alignment	515	36.38	2026-05-06
OpenUGI	Alignment	675	33.02	2026-05-06
TextClass Benchmark	Classification	69	1500.76	2026-05-06
BigCodeBench-Hard	Coding	4	32.40	2026-05-05
BigCodeBench-Hard	Coding	5	31.80	2026-05-05
CadEval	Coding	5	54	2026-05-06
LiveCodeBench	Coding	82	56.662%	2026-05-28
Natural Language to Mongosh	Coding	2	0.89	2026-05-06
Natural Language to Mongosh	Coding	3	0.88	2026-05-06
Natural Language to Mongosh	Coding	4	0.87	2026-05-06
Natural Language to Mongosh	Coding	5	0.87	2026-05-06
Natural Language to Mongosh	Coding	6	0.87	2026-05-06
Natural Language to Mongosh	Coding	8	0.86	2026-05-06
Natural Language to Mongosh	Coding	9	0.86	2026-05-06
Natural Language to Mongosh	Coding	15	0.86	2026-05-06
Natural Language to Mongosh	Coding	16	0.86	2026-05-06
Natural Language to Mongosh	Coding	22	0.85	2026-05-06
Natural Language to Mongosh	Coding	28	0.84	2026-05-06
SciCode	Coding	142	37.6%	2026-05-11
AIRTBench	Cybersecurity	1	46.86	2026-05-06
GSMA Open Telco Leaderboard	Domain	17	65.56	2026-05-06
K-12EduBench	Education	17	61.20	2026-05-27
RoboBench	Embodied	6	40.53	2026-05-27
FinEval	Finance	29	62.9	2026-05-27
MortgageTax	Finance	8	68.68%	2026-05-28
TaxEval v2	Finance	40	72.404%	2026-05-28
HELM AIR-Bench	Generalization	21	0.817703	2026-05-28
HELM Safety	Generalization	18	0.944914	2026-05-28
WeirdML	Generalization	15	39.97	2026-05-06
GeoCode Leaderboard	Geospatial	4	70.35% pass@1	2026-05-28
OmniEarth-Bench	Geospatial	4	29.07	2026-05-27
HELM MedQA	Healthcare	8	0.856859	2026-05-28
HUMAINE	Human Preference	31	3.40	2026-05-06
Artificial Analysis Intelligence Index	Intelligence	142	30.81	2026-05-11
GPQA Diamond	Intelligence	74	67.424%	2026-05-28
Humanity's Last Exam	Intelligence	322	4.8%	2026-05-11
MathVision	Intelligence	43	58.60	2026-05-06
MMLU Pro	Intelligence	57	80.663%	2026-05-28
MMLU-Pro	Intelligence	110	80.3%	2026-05-11
MMMU Pro	Intelligence	48	71.519%	2026-05-28
AraGen v3	Language	7	78.16	2026-05-06
HindiGen v1	Language	12	70.77	2026-05-06
WinoGrande	Language	17	75.10	2026-05-06
LegalBench	Legal	60	80.001%	2026-05-28
LEXam	Legal	3	62.86% open / 57.23% MCQ	2026-05-28
Fiction.LiveBench	Long Context	13	53.10	2026-05-06
AIME	Math	79	22.292%	2026-04-16
AIME 2025	Math	208	21%	2026-05-11
IneqMath	Math	45	2	2026-05-06
IneqMath	Math	50	1	2026-05-06
MATH 500	Math	43	76.8%	2026-01-09
MGSM	Math	19	92.4%	2026-01-09
FrontierMath 2025-02-28 Private	Mathematics	17	4.14	2026-05-06
FrontierMath Tier 4 2025-07-01 Private	Mathematics	12	0	2026-05-06
MATH-500	Mathematics	14	0.96	2026-05-06
OTIS Mock AIME 2024-2025	Mathematics	18	57.78	2026-05-06
LiveMedBench	Medical	11	0.1699	2026-05-27
MedHELM	Medical	3	0.6357142857142857	2026-05-27
AfroBench-Lite	Multilingual	11	60.26	2026-05-06
LanguageBench	Multilingual	3	0.68	2026-05-06
Design Arena	Multimodal	37	1235	2026-05-06
Video SimpleQA	Multimodal	9	36.20	2026-05-06
Visual-Language Understanding	Multimodal	34	43.02	2026-05-06
VPCT	Multimodal	9	39	2026-05-06
Balrog	Reasoning	5	32.60	2026-05-06
EnigmaEval	Reasoning	25	2.26	2026-05-06
GPQA Diamond	Reasoning	245	65.6%	2026-05-11
LingOly-TOO	Reasoning	3	0.43	2026-05-06
SimpleBench	Reasoning	7	46.40	2026-05-06
CritPt	Science	160	0%	2026-05-11
GSO-Bench	Science	7	4.60	2026-05-06
Defects4J	Software Engineering	3	0.478	2026-05-27
RepairBench	Software Engineering	4	0.44	2026-05-27
LiveSQLBench	Text to SQL	21	25.75	2026-05-06
Lech Mazur Writing	Writing	13	8.11	2026-05-06

Metadata

Benchmark Results