BenchmarkList

Recently added benchmarks

Latest additions May 28, 2026 All benchmarks

May 28, 2026

146 benchmarks
2 models

Dyno Therapeutics AAV capsid packaging prediction evaluation reported in Anthropic's Claude Opus 4.8 system card.

Biology Anthropic System Card No corpus, no PLM AUROC Added May 28, 2026
  1. #1 Claude Mythos Preview 0.8
  2. #2 Claude Opus 4.8 0.8
6 models

AgentQuest: Evaluates autonomous agent performance on multi-step tasks requiring planning, state tracking, tool use, and recovery.

Agentic Agentquest Arxiv Latex Table Success Rate Added May 28, 2026 Published Apr 9, 2024
  1. #1 Modified LangChain chat agent + GPT-4 on ALFWorld 93%
  2. #2 LangChain chat agent + GPT-4 on ALFWorld 86%
  3. #3 Modified LangChain chat agent + GPT-4 on Mastermind 60%
9 models

Audio hallucination benchmark for large audio-language models across semantic, acoustic, and confusion hallucination types.

Audio Aha Bench Openreview Table2 Hallucination Accuracy Added May 28, 2026
  1. #1 Gemini-2.5-Pro 60%
  2. #2 GPT-Audio 28.75%
  3. #3 Kimi-Audio 23.94%
3 models

MathArena ArxivMath final-answer research-math benchmark slice from the March and April 2026 releases, as reported in Anthropic's Claude Opus 4.8 system card.

Mathematics Anthropic System Card Score Added May 28, 2026
  1. #1 Claude Opus 4.8 71.8%
  2. #2 GPT-5.5 71.5%
  3. #3 Gemini 3.1 Pro Preview 64.8%
30 models

Interspeech 2026 challenge benchmark for audio reasoning systems, with public Single Model and Agent track leaderboards scored by rubric and accuracy.

Audio Audio Reasoning Challenge Official Leaderboard Rubrics Added May 28, 2026
  1. #1 TalTech (Agent Track) 69.83 rubrics / 76.9% accuracy
  2. #2 AISpeech (Agent Track) 66.23 rubrics / 77.4% accuracy
  3. #3 AI^2 (Agent Track) 66.09 rubrics / 75.1% accuracy
4 models

Verified BioPipelineBench slice for bioinformatics pipeline tasks, reported in Anthropic's Claude Opus 4.8 system card.

Biology Anthropic System Card Accuracy Added May 28, 2026
  1. #1 Claude Mythos Preview 88.1%
  2. #2 Claude Opus 4.8 87.7%
  3. #3 Claude Opus 4.7 83.6%
2 models

Dyno Therapeutics black-box RNA sequence modeling and design evaluation reported in Anthropic's Claude Opus 4.8 system card.

Biology Anthropic System Card Design score (top) Added May 28, 2026
  1. #1 Claude Mythos Preview 11.2
  2. #2 Claude Opus 4.8 10.1
14 models

Spatial-reasoning benchmark measuring how accurately models convert apartment photos into 2D floor plans.

Multimodal Andon Labs Svelte Node Normalized Score Added May 28, 2026
  1. #1 Human* 0.809
  2. #2 GPT 5.5 0.706 +/- 0.008
  3. #3 Gemini 3.5 Flash 0.694 +/- 0.006
8 models

Browser Agent Red teaming Toolkit benchmark for evaluating whether browser agents pursue harmful web tasks despite refusal training in their underlying chat models.

Safety Browserart Arxiv Latex Table Human Rewrite ASR Added May 28, 2026 Published Oct 11, 2024
  1. #1 OpenHands + Opus-3 40%
  2. #2 OpenHands + o1-preview 63%
  3. #3 OpenHands + Gemini-1.5 65%
12 models

CanItEdit: Measures model capability on programming, code generation, code repair, or repository-level software tasks.

Coding Iclr2025 Pdf Table A2 CanItEdit Accuracy Added May 28, 2026
  1. #1 NextCoder-32B 62.4%
  2. #2 QwenCoder-2.5-32B 61%
  3. #3 NextCoder-14B 60.2%
22 models

Clue-grounded long-video question-answering benchmark evaluating MCQ accuracy, clue-grounding credibility metrics, and open-ended answer accuracy.

Video Cg Bench Official Static Leaderboard Open-Ended Accuracy Added May 28, 2026
  1. #1 GPT-4o-08-06 39.2% open-ended acc. / 44.9% MCQ long acc.
  2. #2 Claude3.5-Sonnet 35.6% open-ended acc. / 40.3% MCQ long acc.
  3. #3 InternVL2.5 34.2% open-ended acc. / 44.2% MCQ long acc.
12 models

Subseasonal-to-seasonal climate prediction benchmark with deterministic and probabilistic model leaderboards across T-850, Z-500, and Q-700 variables.

Climate Chaosbench Official Plotly Iframes T-850 RMSE at day 44 Added May 28, 2026
  1. #1 climatology vs. ERA5 (control) 3.3882 T-850 RMSE at day 44
  2. #2 ecmwf vs. ERA5 (ensemble) 3.4618 T-850 RMSE at day 44
  3. #3 ncep vs. ERA5 (ensemble) 3.7616 T-850 RMSE at day 44
2 models

Chart-understanding benchmark reported in Anthropic system cards, with no-tools and Python-tools variants.

Multimodal Anthropic System Card With tools score Added May 28, 2026
  1. #1 Claude Opus 4.8 89.7%
  2. #2 Claude Opus 4.7 85.9%
2 models

Harder chart-understanding evaluation for professional and technical visual-question-answering tasks.

Multimodal Anthropic System Card With tools score Added May 28, 2026
  1. #1 Claude Opus 4.8 72.3%
  2. #2 Claude Opus 4.7 69.8%
13 models

Codeforces: Measures model capability on programming, code generation, code repair, or repository-level software tasks.

Coding Llm Stats Codeforces Embedded Json LLM Stats CodeForces Score Added May 28, 2026
  1. #1 DeepSeek-V4-Flash-Max 1
  2. #1 DeepSeek-V4-Pro-Max 1
  3. #3 DeepSeek-V3.2-Speciale 0.9
33 models

Long story generation benchmark measuring cross-scene consistency bugs using Consistency Error Density over 2,000 generated stories.

Long Context Constory Bench Official Readme Consistency Error Density Added May 28, 2026
  1. #1 GPT-5-Reasoning CED 0.113
  2. #2 Gemini-2.5-Pro CED 0.302
  3. #3 Gemini-2.5-Flash CED 0.305

Showing 16 of 1,002 benchmarks