Benchmark landscape

Benchmark Market Map

AI benchmarks grouped by the ability they test and the rough task shape they use.

Benchmarks
1,002
Known tasks
311,714

Ability

Coding Agents

Software engineering tasks where systems inspect repositories, edit code, run tools, and satisfy verifiers.

141 6/6 lanes

Issue Resolution

22

Bug fixes, PR-style tasks, SWE-bench-style repair.

SWE-bench Lite Software Engineering Curated 300-instance SWE-bench subset for lower-cost evaluation of issue-resolving agents. 84 rows Resolved Top ExpeRepair-v1.0 + Claude 4 Sonnet 60.33% SWE-rebench Coding Continuously evolving, decontaminated software engineering benchmark built from real GitHub pull requests for evaluating coding agents. 34 rows Resolved Rate Top Claude Opus 4.6 65.30 SWE-bench Full Software Engineering Original SWE-bench leaderboard over 2,294 real GitHub issue resolution tasks. 24 rows Resolved Top Sonar Foundation Agent + Claude 4.5 Opus 52.62% DeepSWE Coding Long-horizon software engineering benchmark measuring frontier coding agents on original tasks from active open-source repositories with isolated environments and program-based verifiers. 16 rows 113 tasks Pass@1 Top GPT-5.5 70.05

Repo Context

42

Repository understanding, codebase QA, retrieval, and localization.

BigCodeBench Coding BigCodeBench evaluates code generation on practical and instruction-rich programming tasks, reporting pass@1 in complete and instruct settings. 126 rows Instruct pass@1 Top GPT-4o (2024-05-13) 51.10 Defects4J Software Engineering Defects4J: Evaluates software-engineering agents on realistic issue resolution, repository navigation, testing, or maintenance workflows. 35 rows Defects4J Plausible @1 Top o4-mini-2025-04-16-high 0.538 RepairBench Software Engineering RepairBench: Evaluates software-engineering agents on realistic issue resolution, repository navigation, testing, or maintenance workflows. 35 rows Total Plausible @1 Top o4-mini-2025-04-16-high 0.503 ENAMEL Coding Efficiency-aware code-generation benchmark built from HumanEval problems with expert efficient reference solutions and strong test generators, reporting eff@1 alongside pass@1. 32 rows eff@1 Top HumanEval+ 0.52

Project Building

8

Full apps, larger feature builds, and end-to-end implementation.

ProgramBench Software Engineering A benchmark where software engineering agents rebuild complete programs from compiled binaries and documentation, then are scored against hidden behavioral tests. 9 rows Resolved Top Claude Opus 4.7 0% Vibe Code Bench v1.1 Coding Can models build web applications from scratch? 47 rows Score Top Claude Opus 4.8 82.725% VibeCodingBench Coding Production-oriented coding benchmark evaluating AI coding agents across functional correctness, visual fidelity, code quality, security, cost, and speed on representative developer tasks. 15 rows Avg Score Top Claude Opus 4.5 89.15 ProjDevBench Coding End-to-end project development benchmark evaluating coding agents on complete executable software repository construction from high-level specifications. 10 rows Final Top Codex + GPT-5 77.85

Terminal / DevOps

6

Command-line, shell, container, and operational coding tasks.

Kernel / Perf

11

GPU kernels, optimization, and low-level performance work.

KernelBench Hard Coding KernelBench Hard evaluates autonomous coding agents on GPU kernel engineering tasks, measuring correctness and speed relative to hardware baselines. 12 rows Pass Rate Top GPT-5.5 100 Optimum LLM Perf Leaderboard Inference Hugging Face Optimum Benchmark performance leaderboard for LLM inference configurations across PyTorch CUDA, CPU, OpenVINO, ONNX Runtime, quantization schemes, and hardware profiles. 500 rows Decode Throughput Top trl-internal-testing/tiny-random-LlamaForCausalLM on ['NVIDIA A10G'] (pytorch, unquantized) 383.26 ALE-Bench Coding Score-based algorithmic programming benchmark built from AtCoder Heuristic Contest tasks, evaluating AI systems on hard optimization problems with hidden/private test evaluation. 90 rows 40 tasks Performance (Self-Refine x1) Top GPT-5.5 1942.97 LMArena WebDev Arena Coding LMArena's WebDev Arena leaderboard for model performance on interactive web development tasks judged by human preference. 25 rows Arena rating Top claude-opus-4-7-thinking 1567.85

Code Generation

52

Programming contests, synthesis, completion, and standalone code tasks.

Open Japanese LLM Leaderboard Language LLM-jp open Japanese LLM leaderboard evaluating Japanese and multilingual models across code generation, entity linking, factual association, historical events, commonsense, math, translation, NLI, QA, reading comprehension, and summarization. 862 rows AVG Top deep-analysis-research/Flux-Japanese-Qwen2.5-32B-Instruct-V1.0 74.20 Design Arena Multimodal Crowdsourced arena benchmark for AI-generated design outputs, including code-generation models, website builders, and agentic app-building systems across design and web-app tasks. 215 rows Elo Top claude-opus-4-6 1303 Edge LLM Leaderboard: Raspberry Pi 5 Inference NYU NAIRR edge LLM leaderboard measuring local LLM variants on Raspberry Pi 5 (8GB), combining MMLU accuracy with prefill/decode throughput, model size, quantization, and backend metadata. 128 rows MMLU Accuracy Top Mistral-7B-Instruct-v0.3 (Q8_0, llama_cpp) 43.20 LiveCodeBench Coding Our Implementation of the LiveCodeBench benchmark 113 rows Score Top Gemini 3.1 Pro Preview 88.485%

Ability

Web + Computer Use

Agents operating browsers, GUIs, devices, apps, or enterprise software surfaces.

144 5/5 lanes

Browser / Web

31

Website navigation, web workflows, and browser control.

WebArena Agentic BrowserGym leaderboard slice for WebArena, evaluating autonomous web agents across realistic browser tasks. 12 rows Score Top GenericAgent-Claude-3.7-Sonnet 44.60 AndroidWorld Agentic AndroidWorld: Measures browser, desktop, mobile, or GUI agents operating in interactive environments. 43 rows Success Rate (pass@1) Top AGI-0 97.4% GAIA (HAL) Agentic HAL's standardized, cost-aware agent leaderboard for GAIA web assistance tasks. 32 rows Accuracy Top HAL Generalist Agent / Claude Sonnet 4.5 (September 2025) 74.55 VitaBench Agentic Interactive real-world applications benchmark for LLM agents across delivery, in-store consumption, online travel, and cross-scenario tasks with 66 tools and multi-turn user interactions. 31 rows Cross-Scenarios Avg@4 Top DeepSeek V4 Pro 51.9%

Desktop / OS

21

Desktop environments, operating systems, and GUI apps.

OSWorld Agentic Benchmark for multimodal computer-use agents performing open-ended tasks in real desktop operating-system environments. 104 rows Success rate Top Pointer Agent w/ Opus 4.7 (100 steps) 83.64% HREF Instruction Following HREF evaluates instruction-following models with human response-guided automatic evaluation across 11 task categories. 34 rows Average Top Llama 3.1 70B Instruct 48.98 WMT24++ Translation WMT24++ is a comprehensive multilingual machine translation benchmark that expands the WMT24 dataset to cover 55 languages and dialects. It includes human-written references and post-edits across four domains (literary, news, social, and speech) to evaluate machine translation systems and large language models across diverse linguistic contexts. 25 rows Score Top Nemotron 3 Super 0.87 OSWorld-Verified Agentic OSWorld-Verified is a verified subset of OSWorld, a scalable real computer environment for multimodal agents supporting task setup, execution-based evaluation, and interactive learning across Ubuntu, Windows, and macOS. 23 rows Score Top Claude Mythos Preview 0.80

Enterprise Apps

4

CRM, office, airline, retail, telecom, and workplace apps.

Automation

10

Tool-driven app automation and workflow execution.

scBench Biology Bioinformatics agent benchmark with verifiable single-cell RNA-seq workflow tasks and deterministic graders. 20 rows 195 tasks Accuracy Top Claude Mythos Preview 58.2% AppWorld Agentic Benchmark for interactive app-based task completion across simulated digital services, evaluating agents on tool use and stateful workflows. 15 rows Successful Sessions Top SmolAgents Code / openai/aws/claude-opus-4-5 0.70 AutomationBench Agentic Zapier benchmark for evaluating AI agents on end-to-end business workflow execution across sales, marketing, operations, support, finance, and HR environments. 14 rows Task Success Rate Top Claude Opus 4.8 15.5% ToolSandbox Agentic ToolSandbox: Evaluates tool calling, API use, function selection, structured arguments, and multi-step tool workflows. 13 rows Avg Score Top GPT-4o 73

Visual UI

78

Screen grounding, app UI understanding, and visual web tasks.

Artificial Analysis Intelligence Index Intelligence Artificial Analysis composite benchmark aggregating challenging evaluations across mathematics, science, coding, agentic work, long-context reasoning, instruction following, and factual reliability. 506 rows Intelligence Index Top GPT-5.5 60.24 ARC-AGI-2 Agentic Second ARC-AGI benchmark variant with a harder grid-reasoning task distribution and semi-private leaderboard evaluation. 151 rows Score Top GPT-5.5 85% ARC-AGI-1 Agentic ARC Prize benchmark for few-shot abstract reasoning over grid transformations, using the first ARC-AGI task distribution and semi-private leaderboard evaluation. 148 rows Score Top Gemini 3.1 Pro Preview 98% BenchLM General Knowledge BenchLM is a public aggregate LLM leaderboard that reports overall and category scores for frontier and open-weight models across agentic, coding, reasoning, multimodal-grounded, knowledge, multilingual, instruction-following, and math capabilities. 115 rows Overall Score Top Claude Mythos Preview 99

Ability

Long Context + Memory

Benchmarks stressing retrieval, context windows, evidence use, temporal state, and durable agent memory.

86 4/5 lanes

Document Retrieval

59

RAG and document-grounded answering.

Open LLM Leaderboard v2 General Knowledge Open LLM Leaderboard v2 aggregates model evaluations across IFEval, BBH, MATH Level 5, GPQA, MuSR, and MMLU-PRO for open-weight language models. 4,576 rows Average Top MaziyarPanahi/calme-3.2-instruct-78b 52.08 Open Portuguese LLM Leaderboard Language Portuguese LLM leaderboard evaluating models on ASSIN2 RTE, ASSIN2 STS, FaQuAD NLI, and HateBR offensive-language tasks. 1,117 rows Average score Top nisten/franqwenstein-35b 88.46 Open Arabic LLM Leaderboard Language Open Arabic LLM Leaderboard v2 evaluating Arabic and Arabic-interested language models across AlGhafa, ArabicMMLU, EXAMS, MadinahQA, AraTrust, ALRAGE, and ArbMMLU-HT. 165 rows 7 tasks Average Top Applied-Innovation-Center/Karnak 79.29 C-Eval Intelligence C-Eval: Evaluates broad language-model knowledge, reasoning, commonsense, instruction following, or exam-style accuracy. 143 rows Average Top 海信星海 92.3%

Trajectory Memory

6

Remembering long agent sessions, traces, and user histories.

LongMemEval-V2 Agentic Long-term web-agent memory benchmark evaluating whether memory systems retrieve useful multimodal trajectory evidence for downstream question answering. 6 rows 451 tasks Average Accuracy Top AgentRunbook-C 72.50 Poor Paul's Benchmark Inference Open community benchmark for local LLM inference on consumer, prosumer, and small-business hardware, measuring throughput, latency, power, long-context behavior, tool-call accuracy, answer quality, and memory-oriented tasks. 1,128 rows 1,128 tasks Throughput Top gemma-4-E2B-it IQ4_NL on Apple M4 Pro (llama-server, 32 users) 349.73 Hindsight LLM Memory Leaderboard Agentic LLM leaderboard for Hindsight agent-memory operations, measuring retain(), reflect(), and quality performance over memory extraction and recall workloads. 25 rows Quality Accuracy Top GPT-5 Mini 89.70 Health Memory Arena Agentic Event-driven longitudinal health-agent benchmark over synthetic patient trajectories, evaluating lookup, trend, comparison, anomaly, and explanation capabilities. 17 rows Total Score Top Mirobody (smart-general) 62.10

Long-Context QA

15

Needles, long documents, books, and massive-context question answering.

ConStory-Bench Long Context Long story generation benchmark measuring cross-scene consistency bugs using Consistency Error Density over 2,000 generated stories. 33 rows 2,000 tasks Consistency Error Density Top GPT-5 CED 0.113 Fiction.LiveBench Long Context Fiction comprehension and reasoning benchmark for assessing model understanding over narrative text. 22 rows Score Top o3 100 MRCR v2 (8-needle) Long Context MRCR v2 (8-needle) is a variant of the Multi-Round Coreference Resolution benchmark that includes 8 needle items to retrieve from long contexts. This tests models' ability to simultaneously track and reason about multiple pieces of information across extended conversations. 9 rows Score Top Claude Opus 4.6 0.93 Graphwalks BFS >128k Long Context A graph reasoning benchmark that evaluates language models' ability to perform breadth-first search (BFS) operations on graphs with context length over 128k tokens, testing long-context reasoning capabilities. 7 rows Score Top Claude Mythos Preview 0.80

Temporal State

6

Time, chronology, state tracking, and evolving facts.

Knowledge Recall

0

Open-ended recall and grounded factual memory.

No mapped benchmarks yet.

Ability

Professional Workflows

Domain work where correctness depends on professional conventions, documents, and expert judgment.

129 5/5 lanes

Finance + Business

27

Financial analysis, spreadsheets, investment, tax, and business documents.

FinanceBench Finance FinanceBench evaluates language models on financial analysis questions with source documents, gold answers, and human-annotated model completions. 16 rows Accuracy Top GPT-4 89.33 TaxEval v2 Finance A Vals-created set of questions and responses to tax questions 114 rows Score Top Muse Spark 77.678% CorpFin v2 Finance A private benchmark evaluating understanding of long-context credit agreements 108 rows Score Top Grok 4.3 68.532% MortgageTax Finance Evaluating reading and understanding tax certificates as images 76 rows Score Top Claude Opus 4.7 70.27%

Medical + Health

28

Clinical, biomedical, health, and care workflow tasks.

HealthBench Healthcare HealthBench: Evaluates clinical, biomedical, medical-exam, coding, or healthcare-document reasoning. 5 rows Mean score Top o3 0.5990 BRIDGE Medical Leaderboard Medical Clinical practice text understanding leaderboard for medical LLMs, covering summarization, dialogue, clinical evidence, and EHR-oriented tasks across multiple prompting settings. 321 rows 87 tasks Average Performance Top gemini-1.5-pro-002 (Few-Shot) 55.51 BioASQ Healthcare BioASQ: Evaluates clinical, biomedical, medical-exam, coding, or healthcare-document reasoning. 318 rows Mean exact-answer headline score Top DMIS_MES_TEST_1 0.7415 Open Medical-LLM Leaderboard Medical Open Life Science AI leaderboard evaluating LLMs on medical QA and medical MMLU tasks, including PubMedQA, MedQA, MedMCQA, and six medical MMLU subjects. 185 rows Medical average Top ProbeMedicalYonseiMAILab/medllama3-v20 90.01

Enterprise Ops

20

Operational business workflows and internal process execution.

Tau2-Bench Telecom Agentic Dual-control conversational AI benchmark simulating telecom support scenarios where agent and user coordinate actions to resolve service issues. 406 rows Success Rate Top GLM 4.7 Flash 98.8% Spider 2.0 Data Enterprise text-to-SQL workflow benchmark over BigQuery, Snowflake, DBT, and realistic database engineering tasks. 113 rows 632 tasks Score Top Databao Agent 60.29 GSMA Open Telco Leaderboard Domain Telecommunications-domain model leaderboard across TeleQnA, TeleTables, ORANBench, srsRANBench, TeleMath, TeleLogs, and 3GPP pattern-matching benchmarks. 85 rows Average Top OTel-LLM-8.3B-QnA 85.98 Workspace-Bench Agentic Workspace-agent benchmark over file-heavy tasks involving documents, spreadsheets, presentations, code, and multi-file dependencies. 45 rows Rubric Pass Rate Top OpenClaw + Opus-4.7 66.7%

Research + Science

41

Scientific work, literature, experiments, and research assistance.

SciCode Coding Scientist-curated coding benchmark with subproblems drawn from laboratory problems across scientific disciplines. 483 rows Accuracy Top Gemini 3.1 Pro Preview 58.9% CritPt Science Research-level physics reasoning benchmark with composite challenges designed by active physics researchers. 398 rows Accuracy Top DeepSeek V4 Pro 12.9% PinchBench Agentic Real-world OpenClaw agent benchmark evaluating how LLMs perform as the model inside an agent across practical coding, scheduling, research, email, and file-management workflows. 68 rows Best Score Top Claude Opus 4.6 0.93 ASTA Bench Agentic Allen AI benchmark for scientific discovery agents spanning literature understanding, code execution, data analysis, and end-to-end discovery tasks. 67 rows Overall Score Top Asta Scholar QA (w/ Tables) 91.31

Ability

Multimodal Understanding

Visual, audio, video, spatial, and physical-world perception and reasoning.

102 5/5 lanes

OCR + Docs

27

Document AI, OCR, forms, PDFs, and screenshots.

Medical STT Benchmark Audio Speech-to-text benchmark for long-form medical dialogue, ranking cloud and local transcription systems with Medical Word Error Rate on the PriMock57 dataset. 42 rows Medical WER Top google-gemini-3-pro-preview 0.01 IDP Leaderboard Multimodal Document AI leaderboard combining OCR, table extraction, key information extraction, and visual question answering scores from OlmOCR, OmniDocBench, and IDP Core evaluations. 29 rows Overall Score Top Nanonets OCR-3 85.87 olmOCR-bench Multimodal Document OCR benchmark from AllenAI for measuring OCR model quality on varied real-world document pages. 25 rows Score Top datalab-to/chandra-ocr-2 85.90 Arena AI Document Document AI Crowdsourced Arena AI pairwise human-preference leaderboard for PDF and document-understanding models. 23 rows Arena ELO Top Claude Opus 4.6 1526

Charts + Tables

12

Charts, plots, tables, and structured visual data.

Habitat Challenge Embodied AI Habitat embodied-navigation and rearrangement challenge leaderboards across PointNav, ObjectNav, Rearrange-Easy, and OVMM tracks. 59 rows Track Primary Score Top Arnold (2020 ObjectNav) 0.1 Video-MME Multimodal Video-MME evaluates multimodal video understanding across short, medium, and long videos, with and without subtitle context. 51 rows Overall w subs Top video-SALMONN 2+ 81.60 CALVIN Embodied Language-conditioned robot manipulation benchmark for long-horizon sequences and multitask learning in tabletop environments. 46 rows LH-MTLC average length Top FLOWER (Train A, B, C, D -> Test D) 4.67 Math-VR Multimodal Mathematical visual reasoning benchmark for VLMs, unified models, and LLMs, reporting answer correctness and process scores on text and multimodal questions. 31 rows 2,500 tasks Overall Answer Correctness Top Qwen3 VL 235B A22B Thinking 66.8

Image + Spatial

29

Images, visual QA, localization, and spatial reasoning.

timm ImageNet Robustness Vision The timm leaderboard ranks image classification models across ImageNet and robustness variants including ImageNet-ReaL, ImageNetV2, ImageNet-Sketch, and ImageNet-R. 1,556 rows Average Top-1 Top eva02_large_patch14_448.mim_m38m_ft_in22k_in1k 84.96 ScienceQA Multimodal ScienceQA evaluates multimodal science question answering across natural, social, and language science topics with text and image context splits. 84 rows Average Top Mutimodal-T-SciQ_Large 🥇 96.18 MMMU-Pro Multimodal MMMU-Pro evaluates expert-level multimodal understanding with vision and standard variants derived from the MMMU benchmark. 76 rows MMMU-Pro Overall Top GPT-5.5 83.2% Visual-Language Understanding Multimodal Scale’s SEAL Leaderboard evaluates top models’ visual-language understanding, testing perception, logic, calculation, and common sense. 63 rows Score Top Gemini 2.5 Pro Experimental (March 2025) 54.65

Video + Audio

30

Video, speech, sound, and audiovisual understanding.

Arena AI Image-to-Video Multimodal Crowdsourced Arena AI pairwise human-preference leaderboard for image-to-video generation models. 39 rows Arena ELO Top dreamina-seedance-2.0-720p 1454 Arena AI Text-to-Video Multimodal Crowdsourced Arena AI pairwise human-preference leaderboard for text-to-video generation models. 39 rows Arena ELO Top dreamina-seedance-2.0-720p 1460 HEAR Audio Holistic Evaluation of Audio Representations benchmark comparing general-purpose audio embeddings across diverse audio and speech tasks. 34 rows 19 tasks Score Sum Top RedRice ced_base 14.0610 AudioMC Speech AudioMultiChallenge benchmarks E2E spoken dialogue systems on multi-turn interaction, voice editing, and instruction retention. 30 rows Score Top Gemini 2.5 Pro 46.90

Robotics / Physical

4

Embodied, robotics, manipulation, and physical-world tasks.

Ability

Safety, Security + Trust

Adversarial behavior, misuse, cyber, privacy, robustness, hallucination, and trustworthiness.

135 5/5 lanes

Jailbreaks / Misuse

60

Harmful requests, jailbreaks, policy violations, and refusal behavior.

OpenUGI Alignment Public leaderboard tracking Uncensored General Intelligence and willingness-to-answer scores for AI models on undisclosed sensitive-topic evaluations. 1,218 rows UGI Score Top xai/grok-4.20-multi-agent-beta-0309 (agent_count=4) 70 Open Ko-LLM Leaderboard Language Upstage Open Ko-LLM leaderboard evaluating Korean language model performance across translated reasoning, instruction following, safety, helpfulness, EQ, GSM8K, GPQA, Winogrande, and KorNAT tasks. 1,192 rows Average Top nbeerbower/gemma2-gutenberg-27B 55.93 PandaBench Safety Comprehensive LLM safety benchmark for jailbreak attacks, defense mechanisms, judges, and safety-capability tradeoffs, aggregating attack success rates and AlpacaEval capability scores by model and defense method. 490 rows 104,160 tasks Robustness Score Top Claude-3-5-sonnet + ICL 98.25 RewardBench Alignment Reward model benchmark evaluating preference models across chat, hard chat, safety, reasoning, and prior preference-evaluation sets. 188 rows Score Top infly/INF-ORM-Llama3.1-70B 95.11

Cyber Security

17

CTFs, exploits, secure coding, vulnerabilities, and CWE coverage.

Agent Security League Coding AI coding agent security benchmark measuring functional correctness and security correctness across 200 real-world tasks spanning 77 CWE classes. 17 rows Secure Top Cursor + GPT-5.5 23.50 Claw Bench Agentic Claw Bench is a standardized leaderboard for evaluating AI agent frameworks across task completion, efficiency, security, skills, and UX dimensions. 100 rows Overall Top 土拨鼠的AnyGen 100 FinEval Finance Chinese financial-domain benchmark covering financial academic knowledge, industry knowledge, security, financial agents, multimodal finance tasks, and rigor testing. 49 rows Weighted Average Top Ant Group Finix-CI-72B (fineval 6 0) 86.07 ExploitBench v8-bench Cybersecurity Capability-graded cybersecurity agent benchmark measuring how far AI systems progress on 41 patched V8 exploitation tasks, from coverage and reproduction through exploit primitives and arbitrary code execution. 28 rows Mean score Top Claude Mythos Preview 9.9 points

Privacy + PII

2

PII detection, masking, leakage, and privacy preservation.

Hallucination + Truth

43

Factuality, grounding, hallucination, and truthfulness.

Open Chinese LLM Leaderboard Language BAAI leaderboard for Chinese-oriented LLM evaluation across C-ARC, C-HellaSwag, C-TruthfulQA, C-Winogrande, C-GSM8K, C-SEM, C-MMLU, and CLCC-H. 177 rows Average Top Qwen/Qwen2-72B-Instruct 75.67 YALL Nous Leaderboard Reasoning Yet Another LLM Leaderboard snapshot for the Nous benchmark suite, aggregating public AGIEval, GPT4All, TruthfulQA, and Bigbench scores for open LLMs. 162 rows Average Top mlabonne/OmniTruthyBeagle-7B-v0 57.80 Vectara HHEM Hallucination Leaderboard Factuality Leaderboard using Vectara's Hughes Hallucination Evaluation Model to measure hallucination and factual consistency in document summarization. 102 rows 1,006 tasks Factual Consistency Rate Top antgroup/finix_s1_32b- 98.20 NarrativeQA Generalization NarrativeQA: Measures long-context retrieval, needle finding, summarization, factual grounding, or retrieval-augmented generation quality. 66 rows F1 Top Llama 2 (70B) 76.993143%

Governance + Fairness

13

Bias, fairness, compliance, trust, and alignment checks.

MedQA Healthcare Evaluating language model bias in medical questions. 95 rows Score Top o1 96.517% JSONSchemaBench Structured Output Structured-output benchmark measuring schema compliance and JSON validity for language models across easy, medium, and hard schema-constrained generation datasets. 45 rows Schema Compliance Top GPT-4o 96.9% schema compliance Stick To Your Role! Alignment Leaderboard benchmarking LLM stability in simulated populations and roleplay settings, with ordinal, cardinal, rank-order stability, and structural fit metrics. 32 rows Cardinal score Top Qwen2.5 VL 72B Instruct 0.84 Altered Riddles Reasoning Reasoning benchmark for conditioned override, testing whether models fall back to memorized answers when familiar riddles are deliberately modified with constraints, context swaps, meaning shifts, or bias probes. 23 rows 700 tasks Conditioned Override Rate Top xiaomi/mimo-v2-pro, high reasoning 0.2873

Ability

Tools, Data + Structured Work

Using APIs, databases, tools, spreadsheets, schemas, and structured outputs.

77 5/5 lanes

SQL + Data

9

SQL, databases, analytics, tables, and data agent tasks.

BIRD-SQL Data BIRD-SQL evaluates database-grounded text-to-SQL systems on execution accuracy over large cross-domain databases. 102 rows Test Execution Accuracy Top Human Performance 92.96 TabArena All Tasks Tabular TabArena ranks tabular machine learning systems across all datasets and tasks; this snapshot uses the primary no-imputation, all-repeats, all-tasks leaderboard view. 59 rows Elo Top AutoGluon 1.5 (extreme, 4h) 1700 DataBench Data Real-world tabular question-answering benchmark over many datasets, used in SemEval 2025 Task 8. 37 rows DataBench accuracy Top TeleAI 95.02% Spider Data Spider evaluates complex cross-domain semantic parsing and text-to-SQL generalization over unseen database schemas. 34 rows Execution Accuracy with Values Top MiniSeek 91.20

Function Calling

59

API use, tool calling, and structured tool selection.

BFCL-V4 Tool Use Berkeley Function Calling Leaderboard V4 (BFCL-V4) evaluates LLMs on their ability to accurately call functions and APIs, including simple, multiple, parallel, and nested function calls across diverse programming scenarios. 14 rows Score Top Claude Opus 4.6 76.7% localmaxxing Inference Community leaderboard for local LLM inference speed across model, hardware, engine, quantization, context length, and batch-size configurations. 543 rows Output Throughput Top Qwen3.5-0.8B-Base on NVIDIA H200 NVL (vllm BF16) 2665.14 GQA Intelligence GQA: Evaluates multimodal understanding across image, text, chart, diagram, or cross-modal reasoning tasks. 540 rows Overall accuracy Top c_q7_m8 (B. Zacharie) 87.85% Berkeley Function-Calling Leaderboard Agentic Measures AI models' ability to correctly call and use functions in various contexts 109 rows Overall Accuracy Top Claude Opus 4.5 77.47%

Docs + Sheets

1

Documents, spreadsheets, forms, and office-style structured work.

MCP / APIs

4

MCP servers, API ecosystems, and external tool environments.

Ability

Reasoning + Knowledge

Core model capability tests across math, science, knowledge, language, logic, and instruction following.

188 5/5 lanes

Math

30

Math competitions, arithmetic, proofs, and quantitative reasoning.

MATH Level 5 Math MATH Level 5: Measures mathematical reasoning, symbolic problem solving, proof construction, or competition-style problem solving. 4,576 rows Accuracy Top nvidia/AceMath-72B-Instruct 71.45 Humanity's Last Exam Intelligence Frontier-level benchmark with expert-vetted closed-ended questions across mathematics, sciences, and humanities. 501 rows Accuracy Top Claude Mythos Preview 64.7% AIME 2025 Math All 30 problems from the 2025 American Invitational Mathematics Examination, testing olympiad-level mathematical reasoning. 269 rows Accuracy Top GPT-5.2 99% MathVision Intelligence MathVision evaluates multimodal mathematical reasoning on a full 3,040-example visual math test set. 160 rows All Top GPT-5.4 96.10

Science

53

Science QA, physics, biology, chemistry, and technical exams.

Open Catalyst OC20 Materials OC20 leaderboard for catalyst and adsorbate relaxation/energy prediction, spanning S2EF, IS2RS, and/or IS2RE task splits. 510 rows Task Primary Score Top EquiformerV2 - 31M - LBFGS Fix (IS2RE, OOD Both) 14.59 IS2RE OOD Both primary score GPQA Diamond Reasoning The hardest GPQA subset of graduate-level science questions in biology, chemistry, and physics. 503 rows Accuracy Top Claude Mythos Preview 94.6% JARVIS-Leaderboard AI Materials NIST JARVIS AI model contribution index across materials ML tasks including force fields, property prediction, spectra, atom generation, and materials text tasks. 139 rows 133 tasks Reported Score Top mlearn_analysis_Ge_orb-v2 on mlearnall_Ge_stresses 0.668917 MULTIMAE TextClass Benchmark Classification TextClass Benchmark evaluates LLMs and transformers for social-science text classification across multiple domains and languages, reporting domain-specific Elo leaderboards and a weighted Meta-Elo aggregate. 112 rows Meta-Elo Top GPT-4o 1825.22

Exams + Knowledge

65

General exams, knowledge QA, and broad capability tests.

MuSR Intelligence MuSR: Evaluates broad language-model knowledge, reasoning, commonsense, instruction following, or exam-style accuracy. 4,576 rows Accuracy Top JungZoona/T3Q-Qwen2.5-14B-Instruct-1M-e3 38.69 MMLU-Pro Intelligence Enhanced MMLU benchmark with graduate-level questions across 14 subject areas and ten answer options. 351 rows Accuracy Top Claude Opus 4.6 89.7% Artificial Analysis Openness Index Openness Composite Artificial Analysis measure of model openness across weights availability, licensing, data transparency, and methodology transparency. 233 rows Openness Index Top Apertus 70B Instruct 88.89 AI Energy Score Efficiency AI Energy Score compares model energy efficiency across language, vision, audio, and generation tasks using GPU energy per 1,000 queries and a 1-5 energy score. 204 rows Energy score Top mrm8488/bert-tiny-finetuned-squadv2 (Question Answering) 5

Logic + Planning

24

Puzzles, planning, symbolic reasoning, and hard reasoning.

AlpacaEval Generalization Automatic instruction-following evaluator comparing model responses against a reference using GPT-4 judgments and length-controlled win rates. 102 rows Length-Controlled Win Rate Top xwinlm-70b-v0.1 95.56803995 EQ-Bench Generalization Emotional-intelligence benchmark for language models using scenario questions that test social and emotional understanding. 86 rows EQ-Bench score Top GPT-4 Turbo Preview 86.05 ZebraLogic Reasoning ZebraLogic evaluates models on grid-style zebra logic puzzles, reporting exact puzzle accuracy and cell-level accuracy across difficulty and puzzle sizes. 66 rows Puzzle Acc Top o3 Mini High 91.70 K-MetBench Weather Expert meteorology benchmark over 1,774 Korean National Meteorological Engineer Examination questions, including reasoning, geo-cultural, text-only, and multimodal subsets. 64 rows 1,774 tasks Accuracy Top Gemini 3 93.7% accuracy

Language + Multilingual

16

Language understanding, translation, multilingual, and writing tasks.

NeoEvalPlusN Creative Public leaderboard for proprietary command-following, distractor-resistance, expectation-breaking, poem, and stylized-writing tests run mainly on open-source LLM variants. 202 rows Total Score Top TheDrummer/Behemoth-X-123B-v2 21 VNTL Leaderboard Translation Leaderboard for Japanese visual-novel translation into English, ranking LLMs and translation systems by semantic similarity accuracy over 256 translation samples, with chrF reported as an auxiliary metric. 87 rows 256 tasks Accuracy Top anthropic/claude-3-opus 74.59 IFEval Instruction Following IFEval evaluates instruction following with verifiable prompt-level and instruction-level constraints, reporting strict and loose accuracy scores. 33 rows Final Score Top GLM 5.1 94.5% RP-Bench Creative Roleplay quality benchmark evaluating LLMs on character consistency, user agency, lorebook integration, temporal reasoning, genre craft, and community preference. 30 rows 58 tasks Community ELO Top Claude Opus 4.6 1705.70