BenchmarkList

Recently added benchmarks

Latest additions May 28, 2026 All benchmarks

May 28, 2026

146 benchmarks

2 models

AAV Capsid Packaging Prediction Open

Dyno Therapeutics AAV capsid packaging prediction evaluation reported in Anthropic's Claude Opus 4.8 system card.

Biology Anthropic System Card No corpus, no PLM AUROC Added May 28, 2026

#1 Claude Mythos Preview 0.8
#2 Claude Opus 4.8 0.8

6 models

AgentQuest Open

AgentQuest: Evaluates autonomous agent performance on multi-step tasks requiring planning, state tracking, tool use, and recovery.

Agentic Agentquest Arxiv Latex Table Success Rate Added May 28, 2026 Published Apr 9, 2024

#1 Modified LangChain chat agent + GPT-4 on ALFWorld 93%
#2 LangChain chat agent + GPT-4 on ALFWorld 86%
#3 Modified LangChain chat agent + GPT-4 on Mastermind 60%

9 models

AHa-Bench Open

Audio hallucination benchmark for large audio-language models across semantic, acoustic, and confusion hallucination types.

Audio Aha Bench Openreview Table2 Hallucination Accuracy Added May 28, 2026

#1 Gemini-2.5-Pro 60%
#2 GPT-Audio 28.75%
#3 Kimi-Audio 23.94%

3 models

ArxivMath Open

MathArena ArxivMath final-answer research-math benchmark slice from the March and April 2026 releases, as reported in Anthropic's Claude Opus 4.8 system card.

Mathematics Anthropic System Card Score Added May 28, 2026

#1 Claude Opus 4.8 71.8%
#2 GPT-5.5 71.5%
#3 Gemini 3.1 Pro Preview 64.8%

30 models

Audio Reasoning Challenge Open

Interspeech 2026 challenge benchmark for audio reasoning systems, with public Single Model and Agent track leaderboards scored by rubric and accuracy.

Audio Audio Reasoning Challenge Official Leaderboard Rubrics Added May 28, 2026

#1 TalTech (Agent Track) 69.83 rubrics / 76.9% accuracy
#2 AISpeech (Agent Track) 66.23 rubrics / 77.4% accuracy
#3 AI^2 (Agent Track) 66.09 rubrics / 75.1% accuracy

4 models

BioPipelineBench Verified Open

Verified BioPipelineBench slice for bioinformatics pipeline tasks, reported in Anthropic's Claude Opus 4.8 system card.

Biology Anthropic System Card Accuracy Added May 28, 2026

#1 Claude Mythos Preview 88.1%
#2 Claude Opus 4.8 87.7%
#3 Claude Opus 4.7 83.6%

2 models

Black-box RNA Sequence Design Open

Dyno Therapeutics black-box RNA sequence modeling and design evaluation reported in Anthropic's Claude Opus 4.8 system card.

Biology Anthropic System Card Design score (top) Added May 28, 2026

#1 Claude Mythos Preview 11.2
#2 Claude Opus 4.8 10.1

14 models

Blueprint-Bench 2 Open

Spatial-reasoning benchmark measuring how accurately models convert apartment photos into 2D floor plans.

Multimodal Andon Labs Svelte Node Normalized Score Added May 28, 2026

#1 Human* 0.809
#2 GPT 5.5 0.706 +/- 0.008
#3 Gemini 3.5 Flash 0.694 +/- 0.006

8 models

BrowserART Open

Browser Agent Red teaming Toolkit benchmark for evaluating whether browser agents pursue harmful web tasks despite refusal training in their underlying chat models.

Safety Browserart Arxiv Latex Table Human Rewrite ASR Added May 28, 2026 Published Oct 11, 2024

#1 OpenHands + Opus-3 40%
#2 OpenHands + o1-preview 63%
#3 OpenHands + Gemini-1.5 65%

12 models

CanItEdit Open

CanItEdit: Measures model capability on programming, code generation, code repair, or repository-level software tasks.

Coding Iclr2025 Pdf Table A2 CanItEdit Accuracy Added May 28, 2026

#1 NextCoder-32B 62.4%
#2 QwenCoder-2.5-32B 61%
#3 NextCoder-14B 60.2%

22 models

CG-Bench Open

Clue-grounded long-video question-answering benchmark evaluating MCQ accuracy, clue-grounding credibility metrics, and open-ended answer accuracy.

Video Cg Bench Official Static Leaderboard Open-Ended Accuracy Added May 28, 2026

#1 GPT-4o-08-06 39.2% open-ended acc. / 44.9% MCQ long acc.
#2 Claude3.5-Sonnet 35.6% open-ended acc. / 40.3% MCQ long acc.
#3 InternVL2.5 34.2% open-ended acc. / 44.2% MCQ long acc.

12 models

ChaosBench Open

Subseasonal-to-seasonal climate prediction benchmark with deterministic and probabilistic model leaderboards across T-850, Z-500, and Q-700 variables.

Climate Chaosbench Official Plotly Iframes T-850 RMSE at day 44 Added May 28, 2026

#1 climatology vs. ERA5 (control) 3.3882 T-850 RMSE at day 44
#2 ecmwf vs. ERA5 (ensemble) 3.4618 T-850 RMSE at day 44
#3 ncep vs. ERA5 (ensemble) 3.7616 T-850 RMSE at day 44

2 models

ChartMuseum Open

Chart-understanding benchmark reported in Anthropic system cards, with no-tools and Python-tools variants.

Multimodal Anthropic System Card With tools score Added May 28, 2026

#1 Claude Opus 4.8 89.7%
#2 Claude Opus 4.7 85.9%

2 models

ChartQAPro Open

Harder chart-understanding evaluation for professional and technical visual-question-answering tasks.

Multimodal Anthropic System Card With tools score Added May 28, 2026

#1 Claude Opus 4.8 72.3%
#2 Claude Opus 4.7 69.8%

13 models

Codeforces Open

Codeforces: Measures model capability on programming, code generation, code repair, or repository-level software tasks.

Coding Llm Stats Codeforces Embedded Json LLM Stats CodeForces Score Added May 28, 2026

#1 DeepSeek-V4-Flash-Max 1
#1 DeepSeek-V4-Pro-Max 1
#3 DeepSeek-V3.2-Speciale 0.9

33 models

ConStory-Bench Open

Long story generation benchmark measuring cross-scene consistency bugs using Consistency Error Density over 2,000 generated stories.

Long Context Constory Bench Official Readme Consistency Error Density Added May 28, 2026

#1 GPT-5-Reasoning CED 0.113
#2 Gemini-2.5-Pro CED 0.302
#3 Gemini-2.5-Flash CED 0.305

Showing 16 of 1,002 benchmarks

Recently added benchmarks

May 28, 2026

May 27, 2026

May 26, 2026

May 24, 2026

May 23, 2026

May 21, 2026

May 20, 2026

May 19, 2026

May 15, 2026

May 12, 2026

May 11, 2026

May 7, 2026

May 6, 2026

May 5, 2026

May 4, 2026

May 1, 2026

Apr 29, 2026

Apr 28, 2026

Apr 23, 2026

Apr 21, 2026

Apr 16, 2026

Apr 5, 2026

Mar 24, 2026

Mar 2, 2026

Feb 20, 2026

Jan 9, 2026

Dec 23, 2025

Dec 19, 2025

Nov 18, 2025

Nov 17, 2025

Sep 25, 2025

Sep 11, 2025

Jul 28, 2025

May 16, 2025

Apr 2, 2025

Feb 24, 2025

Dec 13, 2024

Oct 10, 2024

Jan 19, 2024