Dyno Therapeutics AAV capsid packaging prediction evaluation reported in Anthropic's Claude Opus 4.8 system card.
- #1 Claude Mythos Preview 0.8
- #2 Claude Opus 4.8 0.8
BenchmarkList
Dyno Therapeutics AAV capsid packaging prediction evaluation reported in Anthropic's Claude Opus 4.8 system card.
AgentQuest: Evaluates autonomous agent performance on multi-step tasks requiring planning, state tracking, tool use, and recovery.
Interspeech 2026 challenge benchmark for audio reasoning systems, with public Single Model and Agent track leaderboards scored by rubric and accuracy.
Verified BioPipelineBench slice for bioinformatics pipeline tasks, reported in Anthropic's Claude Opus 4.8 system card.
Dyno Therapeutics black-box RNA sequence modeling and design evaluation reported in Anthropic's Claude Opus 4.8 system card.
Spatial-reasoning benchmark measuring how accurately models convert apartment photos into 2D floor plans.
Browser Agent Red teaming Toolkit benchmark for evaluating whether browser agents pursue harmful web tasks despite refusal training in their underlying chat models.
Clue-grounded long-video question-answering benchmark evaluating MCQ accuracy, clue-grounding credibility metrics, and open-ended answer accuracy.
Subseasonal-to-seasonal climate prediction benchmark with deterministic and probabilistic model leaderboards across T-850, Z-500, and Q-700 variables.
Chart-understanding benchmark reported in Anthropic system cards, with no-tools and Python-tools variants.
Harder chart-understanding evaluation for professional and technical visual-question-answering tasks.
Codeforces: Measures model capability on programming, code generation, code repair, or repository-level software tasks.
Long story generation benchmark measuring cross-scene consistency bugs using Consistency Error Density over 2,000 generated stories.
A private benchmark evaluating understanding of long-context credit agreements
Qwen internal cowork productivity benchmark spanning long-horizon tasks in computer science, finance, law, medical, and other productivity domains.
Cursor's coding-agent benchmark for ambiguous, multi-file tasks sourced from real Cursor sessions, with CursorBench 3.1 adding codebase understanding, bugfinding, planning, and code review problems.
CyberBench: Measures model robustness, truthfulness, calibration, bias, harmfulness, jailbreak resistance, or alignment-relevant behavior.
Data-agent benchmark spanning the full data intelligence lifecycle, with data-engineering architecture, implementation, evolution, and data-analysis tasks in English and Chinese.
SecureBio DNA synthesis screening evasion evaluation reported in Anthropic's Claude Opus 4.8 system card.
Interactive Factorio automation benchmark for LLM agents, tracking production score, milestones, automation milestones, and lab-task success rate.
Evaluating agents on core financial analyst tasks using the FAB v2 harness
Financial data-search benchmark evaluating model+search-tool agents on time-sensitive, simple historical, and complex historical data retrieval for global and Greater China markets.
Anthropic internal Firefox 147 SpiderMonkey exploitation evaluation reported in the Claude Opus 4.8 system card.
Interactive 3D vision-reasoning benchmark where models plan physical actions in puzzle and stacking environments.
Software-engineering agent benchmark targeting frontier-level implementation, performance optimization, and research tasks.
Leaderboard for large language models on geospatial code generation using AutoGEEval and AutoGEEval++ pass@k metrics on Google Earth Engine tasks.
Global MMLU multilingual knowledge-and-reasoning evaluation reported in Anthropic's Claude Opus 4.8 system card.
Graduate-level Google-Proof Q&A benchmark evaluating models on questions that require deep reasoning.
Multitask multilingual hallucination detection benchmark spanning QA and dialogue summarization in English, Arabic, Hindi, and Turkish.
Open-source long-horizon legal agent benchmark where agents work from client-matter instructions, closed-universe matter files, and expert rubrics to produce reviewable legal work product.
HealthBench professional subset for medically challenging, expert-oriented healthcare question answering.
HELM AIR-Bench: Measures model robustness, truthfulness, calibration, bias, harmfulness, jailbreak resistance, or alignment-relevant behavior.
HELM Instruct: Evaluates broad language-model knowledge, reasoning, commonsense, instruction following, or exam-style accuracy.
HELM MedQA: Evaluates clinical, biomedical, medical-exam, coding, or healthcare-document reasoning.
HELM Safety: Measures model robustness, truthfulness, calibration, bias, harmfulness, jailbreak resistance, or alignment-relevant behavior.
Humanity's Last Exam results reported by Qwen with tool use enabled.
Head-to-head LLM game-arena ladder for chess, using InfiniteBM's per-game Bradley-Terry Elo ratings across model and human matches.
Head-to-head LLM game-arena ladder for Coup, using InfiniteBM's per-game Bradley-Terry Elo ratings across bluffing and imperfect-information matches.
Head-to-head LLM game-arena ladder for heads-up no-limit hold'em, using InfiniteBM's per-game Bradley-Terry Elo ratings across poker decision matches.
Head-to-head LLM game-arena ladder for Liar's Dice, using InfiniteBM's per-game Bradley-Terry Elo ratings across hidden-information bidding and challenge timing matches.
Head-to-head LLM game-arena ladder for Settlers of Catan, using InfiniteBM's per-game Bradley-Terry Elo ratings across negotiation and planning matches.
Head-to-head LLM game-arena ladder for Werewolf, using InfiniteBM's per-game Bradley-Terry Elo ratings across social-deduction matches.
Artificial Analysis implementation of IBM's ITBench SRE benchmark, testing AI agents on Kubernetes incident root-cause analysis from offline incident snapshots.
NIST JARVIS AI model contribution index across materials ML tasks including force fields, property prediction, spectra, atom generation, and materials text tasks.
Structured-output benchmark measuring schema compliance and JSON validity for language models across easy, medium, and hard schema-constrained generation datasets.
Kernel optimization benchmark reporting median speedup over a PyTorch eager reference and the fraction of problems faster than torch.compile.
LABBench2 clinical-trials subset reported in Anthropic's Claude Opus 4.8 system card.
LABBench2 patent-question subset reported in Anthropic's Claude Opus 4.8 system card.
LABBench2 table-reading subset reported in Anthropic's Claude Opus 4.8 system card.
LABBench2 supplementary-materials subset reported in Anthropic's Claude Opus 4.8 system card.
Evaluating language models on a wide range of open source legal reasoning tasks.
Legal reasoning benchmark derived from 340 law exams across 116 law-school courses, covering long-form open questions and multiple-choice questions in English and German.
Our Implementation of the LiveCodeBench benchmark
Live benchmark for research-level theorem comprehension, with monthly multiple-choice questions derived from newly published arXiv mathematics papers.
Automated biological-risk evaluation for long-form virology task completion, reported in Anthropic's Claude Opus 4.8 system card.
Automated biological-risk evaluation for long-form virology task completion, reported in Anthropic's Claude Opus 4.8 system card.
Finance benchmark measuring lookahead bias in LLM trading workflows by comparing in-sample and out-of-sample alpha decay.
Magic: The Gathering benchmark leaderboard for LLMs, reporting Season 1 model ratings, blunder index, games played, win rate, and average API cost across the combined format leaderboard.
Materials-discovery benchmark evaluating machine-learning energy models for crystal stability prediction, geometry optimization, and related high-throughput discovery tasks.
Multi-image spatial intelligence VQA benchmark with 1,000 human-designed questions across real-world 3D scene understanding, robotics, driving, and motion reasoning.
Evaluating reading and understanding tax certificates as images
MRCR-v2 long-context retrieval subset using a 128K context window with 8 needles, as reported in Qwen's Qwen3.7-Max launch post.
Professional office question-answering evaluation reported in Anthropic's Claude Opus 4.8 system card.
Harder professional office question-answering evaluation reported in Anthropic's Claude Opus 4.8 system card.
OC20 leaderboard for catalyst and adsorbate relaxation/energy prediction, spanning S2EF, IS2RS, and/or IS2RE task splits.
OC22 leaderboard for catalyst and adsorbate relaxation/energy prediction, spanning S2EF, IS2RS, and/or IS2RE task splits.
Anthropic internal organic-chemistry evaluation reported in the Claude Opus 4.8 system card.
Synthetic insider-threat detection benchmark built from OrgForge organizational simulation telemetry with triage, verdict, and false-positive scoring.
Anthropic system-card ProgramBench evaluation on the 166-task golden subset, reported as hidden behavioral test pass rate across 1-5 episodes.
Automated theorem proving benchmark
Protein fitness prediction benchmark comparing zero-shot models on DMS substitution assays with aggregate Spearman correlations and drill-downs by function, MSA depth, taxa, and mutation depth.
Hard ProteinGym subset reported in Anthropic's Claude Opus 4.8 system card.
Anthropic internal biological-protocol troubleshooting evaluation reported in the Claude Opus 4.8 system card.
Numerical forecasting benchmark evaluating whether LLMs produce calibrated 90% prediction intervals for 1,000 real-world questions under zero-shot, grounded, and agentic retrieval settings.
Real-user-distribution Claw agent benchmark referenced in Qwen's Qwen3.7-Max launch post.
Qwen internal front-end code generation benchmark covering bilingual web artifact tasks with auto-rendering and multimodal judging.
Qwen internal benchmark evaluating LLMs as world models for simulating agentic environments across Terminal, SWE, MCP, Search, OS, Android, and Web domains.
Open evaluation framework from Callstack measuring AI model performance on real-world React Native development tasks across navigation, animation, async state, lists, and React Native APIs.
Cybersecurity benchmark with 30K multiple-choice and 240 open-ended QA items covering knowledge, offensive skills, and tool expertise.
Vendor-reported 928-question finance-agent benchmark spanning vertical-specific skills, metrics, financial-statement analysis, and forecasting workflows.
Security engineering benchmark evaluating agents on vulnerability discovery, proof-of-concept generation, and vulnerability patching targets.
Security benchmark for AI-generated and AI-repaired code, reporting secure-code repair and generation scores with and without hints.
Structured-output benchmark evaluating text and visual structured generation and conversion across 18 formats and 2,035 examples.
Open-ended structural-biology evaluation reported in Anthropic's Claude Opus 4.8 system card.
Solving production software engineering tasks
A Vals-created set of questions and responses to tax questions
Therapeutics Data Commons ADMET benchmark leaderboard for TDC.BBB_Martins, reporting AUROC on submitted molecular models.
Therapeutics Data Commons drug-target interaction domain-generalization benchmark on BindingDB Patent.
Therapeutics Data Commons ADMET benchmark leaderboard for TDC.Bioavailability_Ma, reporting AUROC on submitted molecular models.
Therapeutics Data Commons ADMET benchmark leaderboard for TDC.Caco2_Wang, reporting MAE on submitted molecular models.
Therapeutics Data Commons ADMET benchmark leaderboard for TDC.Clearance_Hepatocyte_AZ, reporting Spearman on submitted molecular models.
Therapeutics Data Commons ADMET benchmark leaderboard for TDC.Clearance_Microsome_AZ, reporting Spearman on submitted molecular models.
Therapeutics Data Commons ADMET benchmark leaderboard for TDC.CYP2C9_Substrate_CarbonMangels, reporting AUPRC on submitted molecular models.
Therapeutics Data Commons ADMET benchmark leaderboard for TDC.CYP2C9_Veith, reporting AUPRC on submitted molecular models.
Therapeutics Data Commons ADMET benchmark leaderboard for TDC.CYP2D6_Substrate_CarbonMangels, reporting AUPRC on submitted molecular models.
Therapeutics Data Commons ADMET benchmark leaderboard for TDC.CYP2D6_Veith, reporting AUPRC on submitted molecular models.
Therapeutics Data Commons ADMET benchmark leaderboard for TDC.CYP3A4_Substrate_CarbonMangels, reporting AUROC on submitted molecular models.
Therapeutics Data Commons ADMET benchmark leaderboard for TDC.CYP3A4_Veith, reporting AUPRC on submitted molecular models.
Therapeutics Data Commons docking benchmark for DRD3 molecule generation with oracle-call-budget leaderboard tables.
Therapeutics Data Commons ADMET benchmark leaderboard for TDC.Half_Life_Obach, reporting Spearman on submitted molecular models.
Therapeutics Data Commons ADMET benchmark leaderboard for TDC.HIA_Hou, reporting AUROC on submitted molecular models.
Therapeutics Data Commons ADMET benchmark leaderboard for TDC.LD50_Zhu, reporting MAE on submitted molecular models.
Therapeutics Data Commons ADMET benchmark leaderboard for TDC.Lipophilicity_AstraZeneca, reporting MAE on submitted molecular models.
Therapeutics Data Commons ADMET benchmark leaderboard for TDC.Pgp_Broccatelli, reporting AUROC on submitted molecular models.
Therapeutics Data Commons ADMET benchmark leaderboard for TDC.PPBR_AZ, reporting MAE on submitted molecular models.
Therapeutics Data Commons ADMET benchmark leaderboard for TDC.Solubility_AqSolDB, reporting MAE on submitted molecular models.
Therapeutics Data Commons TCR-epitope binding interaction benchmark with NA and RN sampling-method leaderboards.
Therapeutics Data Commons ADMET benchmark leaderboard for TDC.VDss_Lombardo, reporting Spearman on submitted molecular models.
State-of-the-art set of difficult terminal-based tasks
State-of-the-art set of difficult terminal-based tasks
Simulator-based bilateral price negotiation benchmark for LLM agents, measuring surplus extraction, feasible agreement calibration, belief error, and procedural compliance without an LLM judge.
Thai language and Thai cultural-context safety benchmark reporting attack success rate across harmful-content categories.
2026 USA Mathematical Olympiad proof-based evaluation reported in Anthropic's Claude Opus 4.8 system card using MathArena-style model-judge grading.
Benchmark consisting of a weighted performance across finance and coding tasks. Showing the potential impact that LLM's can have on the economy.
Benchmark consisting of a weighted performance across finance, coding, and education tasks. Showing the potential impact that LLM's can have on the economy.
Multi-modal structured extraction benchmark over 1,777 government forms with per-document JSON schemas and image/text/spatial modalities.
SecureBio Virology Capabilities Test multimodal virology evaluation reported in Anthropic's Claude Opus 4.8 system card.
Long-horizon autonomous agent benchmark measuring how well models operate a simulated vending-machine business over an extended period.
Can models build web applications from scratch?
Vision-language model throughput benchmark reporting peak tokens/sec, TTFT, TPOT, and worker count on an RTX PRO 6000 Blackwell vLLM setup.
Reproducible browser-agent benchmark of 532 tedious web tasks extending WebArena with massive-memory, calculation, and long-term-memory chores.
Agent API-use benchmark measuring robustness to real-world API complexity across multi-turn conversations, API functions, and injected complexity scenarios.
Interactive game-agent leaderboard where LLMs play games in LLM-generated contexts with LLM-enforced rules and LLM-scored outcomes.
Agentic backend coding benchmark evaluating whether coding agents can explore real repositories, edit code, configure environments, deploy containerized services, and pass external HTTP integration tests.
ActivityNet-QA: Evaluates temporal, video, speech, or audio understanding beyond static text and image inputs.
AgentBoard: Evaluates autonomous agent performance on multi-step tasks requiring planning, state tracking, tool use, and recovery.
Clinical simulation benchmark for sequential diagnostic decision-making with multimodal patient interactions and compliance checks.
AgentDojo: Measures model robustness, truthfulness, calibration, bias, harmfulness, jailbreak resistance, or alignment-relevant behavior.
Spreadsheet-assistant benchmark with realistic spreadsheet prompts, dynamic-output checks, and latency measurements.
Aider polyglot coding-agent leaderboard over 225 Exercism tasks across C++, Go, Java, JavaScript, Python, and Rust.
Audio-text conflict benchmark measuring whether audio-language models follow the audio signal instead of conflicting text.
Automatic instruction-following evaluator comparing model responses against a reference using GPT-4 judgments and length-controlled win rates.
Reasoning benchmark for conditioned override, testing whether models fall back to memorized answers when familiar riddles are deliberately modified with constraints, context swaps, meaning shifts, or bias probes.
Android development benchmark measuring how well LLM agents resolve real Android tasks, with success rate, latency, token use, and cost.
AndroidWorld: Measures browser, desktop, mobile, or GUI agents operating in interactive environments.
ARC Challenge: Evaluates broad language-model knowledge, reasoning, commonsense, instruction following, or exam-style accuracy.
Arena-Hard: Evaluates conversational quality, human preference, helpfulness, and pairwise response judgments.
Multi-turn voice-agent benchmark with appointment, assistant, conversation, event, grocery, and product workflows scored across multiple dimensions.
Audio-language benchmark covering speech, sound, music, ASR, QA, translation, and audio reasoning tasks across many task-level metrics.
Medical AutoResearch benchmark for single-LLM coding agents across segmentation, image enhancement, VQA, report generation, and lesion detection, with workflow and task-quality scoring.
Robotics challenge leaderboard for BEHAVIOR-1K household activity policies, ranked by Q-score and full-task success.
Measures AI models' ability to correctly call and use functions in various contexts
Business-driven financial benchmark covering anomalous event attribution, numerical computation, time reasoning, financial QA, event relation, stock prediction, and entity recognition.
Cybersecurity benchmark measuring AI agent detection, exploitation, and patching on real-world bug bounty tasks, including success rates, bounty value, and token costs.
Clinical practice text understanding leaderboard for medical LLMs, covering summarization, dialogue, clinical evidence, and EHR-oriented tasks across multiple prompting settings.
Composite CAIS AI Dashboard risk index averaging VCT refusal risk, HLE miscalibration, MASK risk, Machiavelli, and TextQuests Harm for models with all component scores. Lower is better.
Composite CAIS AI Dashboard text index averaging Humanity's Last Exam, ARC-AGI-2, TextQuests, and SWE-bench Pro for models with all component scores.
Composite CAIS AI Dashboard vision index averaging EnigmaEval, IntPhys2, ERQA, MindCube, ART, and SpatialViz for models with all component scores.
Autonomous-driving agent leaderboard in CARLA measuring driving score, route completion, and infraction penalties.
Quarterly refreshed enterprise-workflow benchmark grounded in live ClawHub marketplace signals and scored with deterministic checks plus structured judging.
Climate model emulation benchmark over NorESM2 simulation outputs, reporting NRMSE for climate variables under future scenarios.
Clotho-AQA: Evaluates temporal, video, speech, or audio understanding beyond static text and image inputs.
CodeContests: Measures model capability on programming, code generation, code repair, or repository-level software tasks.
Code editing benchmark covering debugging, code translation, requirement switching, and code polishing across primary and plus splits.
CommonsenseQA: Evaluates broad language-model knowledge, reasoning, commonsense, instruction following, or exam-style accuracy.
ContractNLI: Measures legal reasoning, contract review, statute interpretation, or legal-domain QA.
HAL's cost-aware agent leaderboard for CORE-Bench Hard scientific programming tasks.
CrowS-Pairs: Measures model robustness, truthfulness, calibration, bias, harmfulness, jailbreak resistance, or alignment-relevant behavior.
CyberSecEval: Measures model robustness, truthfulness, calibration, bias, harmfulness, jailbreak resistance, or alignment-relevant behavior.
Multilingual variant of DeepResearch Bench for evaluating deep research agents across translated prompt sets and languages.
End-to-end document understanding benchmark suite with a public evaluator and task-specific document AI metrics.
Dynamic speech-language-model benchmark and leaderboard for speech instruction following across many audio tasks.
Education benchmark spanning student and teacher scenarios, educational tasks, and multidimensional educational response quality.
Education-specific safety benchmark for teaching harm, adversarial safety, pedagogical fidelity, refusal, and safe tutoring behavior.
FinanceQA leaderboard for industry-grade financial analysis tasks across basic tactical, assumption-based tactical, and conceptual categories.
Chinese financial-domain benchmark covering financial academic knowledge, industry knowledge, security, financial agents, multimodal finance tasks, and rigor testing.
Financial tool-use benchmark with real tools and APIs, measuring tool invocation, execution success, compliance, and soft-scored task quality.
FrontierMath: Measures mathematical reasoning, symbolic problem solving, proof construction, or competition-style problem solving.
HAL's standardized, cost-aware agent leaderboard for GAIA web assistance tasks.
Gaokao-Bench: Evaluates broad language-model knowledge, reasoning, commonsense, instruction following, or exam-style accuracy.
Repository-level code-agent benchmark covering real GitHub tasks, reporting task pass rate, execution completion rate, token usage, and cost.
AI Habitat embodied-navigation and rearrangement challenge leaderboards across PointNav, ObjectNav, Rearrange-Easy, and OVMM tracks.
Healthcare administration agent benchmark for prior authorization, appeals, durable medical equipment, payer portals, fax, and EHR-adjacent workflows.
HealthBench: Evaluates clinical, biomedical, medical-exam, coding, or healthcare-document reasoning.
HealthBench Hard: Evaluates clinical, biomedical, medical-exam, coding, or healthcare-document reasoning.
HELM GSM8K: Measures mathematical reasoning, symbolic problem solving, proof construction, or competition-style problem solving.
HELM LegalBench: Measures legal reasoning, contract review, statute interpretation, or legal-domain QA.
HELM Long Context: Measures long-context retrieval, needle finding, summarization, factual grounding, or retrieval-augmented generation quality.
HumanEval+ code-generation leaderboard from EvalPlus.
Live provider-level inference latency, throughput, cost, and reliability tracker for hosted language-model APIs.
InfiniteBench: Measures long-context retrieval, needle finding, summarization, factual grounding, or retrieval-augmented generation quality.
InfographicVQA: Measures visual question answering, OCR, document understanding, chart comprehension, or layout-aware reasoning.
Financial decision-making benchmark for investment agents, evaluating portfolio or trading decisions rather than only financial QA.
K-12 education benchmark for subject knowledge, problem solving, and educational-goal cognition across school-level tasks.
LeanDojo Benchmark: Measures mathematical reasoning, symbolic problem solving, proof construction, or competition-style problem solving.
Live medical benchmark with time-stamped real-world cases and after-cutoff scoring for measuring medical model robustness over time.
Dynamic live safety benchmark for large language models across ethics, legality, privacy, factuality, and psychological health.
LLaVA-Bench: Evaluates multimodal understanding across image, text, chart, diagram, or cross-modal reasoning tasks.
LLaVA-Bench in the Wild: Evaluates multimodal understanding across image, text, chart, diagram, or cross-modal reasoning tasks.
LongBench v2: Measures long-context retrieval, needle finding, summarization, factual grounding, or retrieval-augmented generation quality.
Long-document understanding benchmark for locating and reasoning over evidence across very large document collections.
Open-ended math tutoring benchmark for measuring pedagogical capabilities such as mistake diagnosis, Socratic questioning, and scaffolding.
Interactive EHR-agent benchmark with physician-written tasks over healthcare data and FHIR-style clinical workflows.
Medical browsing and search benchmark for multi-hop clinical research questions over live or web-grounded medical sources.
Clinical LLM benchmark leaderboard spanning closed-ended medical QA, open-ended clinical tasks, medical safety, summarization, note generation, HealthBench, EHRSQL, MedCalc, MedEC, general-domain, and DischargeMe evaluations.
Medical diagnostic safety benchmark measuring harm-weighted safety pass rate, coverage, diagnostic recall, and over-escalation behavior.
MIMIC-IV-Ext-Bench: Evaluates clinical, biomedical, medical-exam, coding, or healthcare-document reasoning.
MLAgentBench: Evaluates autonomous agent performance on multi-step tasks requiring planning, state tracking, tool use, and recovery.
MLCommons client-device inference benchmark for PC, Mac, and local/client form-factor AI performance.
MLCommons benchmark for API-hosted GenAI endpoints, measuring serving latency, throughput, concurrency, and endpoint efficiency.
Audited MLCommons inference benchmark results for datacenter, edge, and GenAI workloads, including throughput, latency, and power-efficiency views.
MMBench-CN: Evaluates multimodal understanding across image, text, chart, diagram, or cross-modal reasoning tasks.
Fine-grained multimodal document understanding benchmark with OCR-free VQA, grounding, and document reasoning tasks.
MMLU Medical Genetics: Evaluates clinical, biomedical, medical-exam, coding, or healthcare-document reasoning.
MMLU Professional Medicine: Evaluates clinical, biomedical, medical-exam, coding, or healthcare-document reasoning.
MobileWorld: Measures browser, desktop, mobile, or GUI agents operating in interactive environments.
MoralChoice: Measures model robustness, truthfulness, calibration, bias, harmfulness, jailbreak resistance, or alignment-relevant behavior.
MultiHop-RAG: Measures long-context retrieval, needle finding, summarization, factual grounding, or retrieval-augmented generation quality.
MultiMedQA: Evaluates clinical, biomedical, medical-exam, coding, or healthcare-document reasoning.
NarrativeQA: Measures long-context retrieval, needle finding, summarization, factual grounding, or retrieval-augmented generation quality.
Natural Questions: Evaluates broad language-model knowledge, reasoning, commonsense, instruction following, or exam-style accuracy.
Needle In A Haystack: Measures long-context retrieval, needle finding, summarization, factual grounding, or retrieval-augmented generation quality.
NeedleBench: Measures long-context retrieval, needle finding, summarization, factual grounding, or retrieval-augmented generation quality.
NVIDIA benchmark for evaluating LLM-generated CUDA and CUDA Python code on correctness and optional GPU performance across kernels, runtime APIs, memory management, parallel algorithms, and GPU libraries.
Open Agent Security Benchmark results for AI-agent skills security scanners, measuring precision, recall, F1, false-positive rate, flag rate, and scan latency.
Office workflow agent benchmark spanning Word, Excel, PDF, email, calendar, and multi-application task completion.
Holistic multimodal Earth-science benchmark across atmosphere, oceans, cryosphere, biosphere, land, and human-earth interaction tasks.
HAL's standardized, cost-aware agent leaderboard for Online Mind2Web web navigation tasks.
Open financial-language-model leaderboard from FINOS, covering broad financial NLP and reasoning task categories.
OpenAI Evals: Evaluates broad language-model knowledge, reasoning, commonsense, instruction following, or exam-style accuracy.
Global high-resolution land-cover mapping benchmark and challenge family for Earth-observation segmentation.
OS-Copilot: Measures browser, desktop, mobile, or GUI agents operating in interactive environments.
Benchmark of complex multimodal desktop GUI-navigation tasks for advanced agents, with automated validation and task levels spanning basic precision clicking to multi-application workflows.
Long-horizon physician workflow benchmark grounded in clinical records, measuring checkpoint and end-to-end task success.
Medical-domain hallucination benchmark with labeled model answers to pharmaceutical questions grounded in EMA product information.
ProofNet: Measures mathematical reasoning, symbolic problem solving, proof construction, or competition-style problem solving.
Large benchmark for table detection, structure recognition, and functional analysis in scientific documents.
Multilingual document table-extraction benchmark with provider leaderboard and T-LAG scoring.
PutnamBench: Measures mathematical reasoning, symbolic problem solving, proof construction, or competition-style problem solving.
RealToxicityPrompts: Measures model robustness, truthfulness, calibration, bias, harmfulness, jailbreak resistance, or alignment-relevant behavior.
RealWorldQA: Evaluates multimodal understanding across image, text, chart, diagram, or cross-modal reasoning tasks.
RepairBench: Evaluates software-engineering agents on realistic issue resolution, repository navigation, testing, or maintenance workflows.
Benchmark for AI coding and research agents that asks systems to conduct scientific research from raw data to publication-quality reports, with tasks grounded in human-authored target studies.
Autonomous AI-research-agent benchmark spanning the full ML research cycle across five realistic research tasks.
Chest radiology interpretation leaderboard covering radiology report generation and medical VQA on ReXGradient, MIMIC-CXR, IU-Xray, CheXpert Plus, and ReXVQA.
HAL's cost-aware agent leaderboard for Scicode scientific programming tasks.
HAL's standardized, cost-aware agent leaderboard for ScienceAgentBench scientific agent tasks.
Scientific knowledge evaluation benchmark spanning biology, chemistry, materials science, and physics tasks.
ScreenSpot: Measures browser, desktop, mobile, or GUI agents operating in interactive environments.
Spreadsheet control benchmark for agents operating spreadsheet software through actions rather than only answering table questions.
Spatial transcriptomics agent benchmark with verifiable spatial biology analysis tasks and deterministic graders.
Enterprise text-to-SQL workflow benchmark over BigQuery, Snowflake, DBT, and realistic database engineering tasks.
Spreadsheet-agent benchmark for real Excel tasks and business spreadsheet workflows, including financial modeling, debugging, and visualization.
StrategyQA: Evaluates broad language-model knowledge, reasoning, commonsense, instruction following, or exam-style accuracy.
Sustainability benchmark suite across agriculture, poverty, land cover, water, climate-action, and related geospatial ML tasks.
SWE-bench Docker: Evaluates software-engineering agents on realistic issue resolution, repository navigation, testing, or maintenance workflows.
SWE-bench Extra: Evaluates software-engineering agents on realistic issue resolution, repository navigation, testing, or maintenance workflows.
SWE-bench JavaScript: Evaluates software-engineering agents on realistic issue resolution, repository navigation, testing, or maintenance workflows.
Microsoft live SWE-bench-style benchmark for real-world issue resolution, updated with recent GitHub tasks and frozen lite/verified splits for evaluation.
HAL's cost-aware agent leaderboard for the SWE-bench Verified Mini software engineering subset.
Polyglot software-engineering benchmark with Java, Python, JavaScript, and TypeScript tasks plus retrieval/localization diagnostics.
Pull-request review benchmark with a public paper-baseline leaderboard for model review quality and false-positive behavior.
TableBench: Measures structured-data reasoning over tables, spreadsheets, charts, databases, or data analysis tasks.
HAL's standardized, cost-aware agent leaderboard for TAU-bench Airline customer-service tasks.
TaxBench evaluates AI models on real-world tax tasks from Rivet's active tax workflows, spanning tax knowledge and judgment, tax calculations, and agentic data-retrieval question answering.
ToolAlpaca: Evaluates tool calling, API use, function selection, structured arguments, and multi-step tool workflows.
ToolSandbox: Evaluates tool calling, API use, function selection, structured arguments, and multi-step tool workflows.
HAL's cost-aware agent leaderboard for USACO competitive programming tasks.
Spoken tool-use agent benchmark for speech-in agents performing tool selection, parameter filling, orchestration, multi-turn handling, and safety checks.
Multimodal voice-assistant benchmark covering listening, speaking, viewing, role-play, audio context, and consistency dimensions.
Multifaceted benchmark for LLM-based voice assistants across speech-input knowledge, instruction following, safety, robustness, and accents/noise.
Voyager Minecraft: Measures embodied-agent, navigation, manipulation, or simulated robotics task success.
Benchmark dataset for data-driven medium-range weather forecasting, reporting Z500 and T850 RMSE at 3-day and 5-day lead times.
Standard benchmark and evaluation framework for data-driven weather forecasting models.
WikiSQL: Measures structured-data reasoning over tables, spreadsheets, charts, databases, or data analysis tasks.
WikiTableQuestions: Measures structured-data reasoning over tables, spreadsheets, charts, databases, or data analysis tasks.
Windows Agent Arena: Measures browser, desktop, mobile, or GUI agents operating in interactive environments.
Workspace-agent benchmark over file-heavy tasks involving documents, spreadsheets, presentations, code, and multi-file dependencies.
Long-horizon software engineering benchmark measuring frontier coding agents on original tasks from active open-source repositories with isolated environments and program-based verifiers.
Long-term web-agent memory benchmark evaluating whether memory systems retrieve useful multimodal trajectory evidence for downstream question answering.
Patent prosecution AI benchmark from ABIGAIL covering USPTO Office Action parsing, docketing, response strategy, drafting, prior-art analysis, and hallucination checks.
Expert meteorology benchmark over 1,774 Korean National Meteorological Engineer Examination questions, including reasoning, geo-cultural, text-only, and multimodal subsets.
Generalist robot-policy leaderboard over RoboCasa atomic and composite kitchen manipulation tasks, including seen and unseen settings.
Zapier benchmark for evaluating AI agents on end-to-end business workflow execution across sales, marketing, operations, support, finance, and HR environments.
Roboflow Vision Evals open-vocabulary object detection benchmark where models draw bounding boxes from text queries across SaCo-Gold and COCO-100.
Roboflow Vision Evals OCR benchmark combining OCRBench text-reading prompts and a license-plate recognition dataset.
Roboflow Vision Evals open-vocabulary instance segmentation benchmark where models output pixel masks from text queries across SaCo-Gold and COCO-100.
Roboflow Vision Evals benchmark for visual QA tasks such as reading text from photos, counting objects, spotting defects, and understanding documents.
Benchmark for autonomous CLI agents optimizing OpenAI-compatible LLM inference servers under a fixed one-H100, two-hour budget, with quality and integrity gates and scenario-specific speedup metrics.
Intology benchmark for autonomous coding and ML research agents on the NanoGPT Speedrun, measuring how much historical human speedrun progress agents can recover from a strong human starting point under a fixed H100-hour compute budget.
Capability-graded cybersecurity agent benchmark measuring how far AI systems progress on 41 patched V8 exploitation tasks, from coverage and reproduction through exploit primitives and arbitrary code execution.
AIIQ composite estimate that combines abstract, mathematical, programmatic, and academic reasoning benchmark evidence into IQ-like model scores.
Artificial Analysis knowledge and hallucination benchmark measuring factual recall, abstention, and hallucination across economically relevant domains.
Artificial Analysis implementation of APEX-Agents using the Stirrup agent harness for long-horizon, cross-application professional-services tasks.
Artificial Analysis composite benchmark aggregating challenging evaluations across mathematics, science, coding, agentic work, long-context reasoning, instruction following, and factual reliability.
Composite Artificial Analysis measure of model openness across weights availability, licensing, data transparency, and methodology transparency.
Gert Labs global model ranking across game environments that evaluate agentic coding, one-shot coding, and decision-making performance.
The hardest GPQA subset of graduate-level science questions in biology, chemistry, and physics.
Frontier-level benchmark with expert-vetted closed-ended questions across mathematics, sciences, and humanities.
Dual-control conversational AI benchmark simulating telecom support scenarios where agent and user coordinate actions to resolve service issues.
Artificial Analysis Terminal-Bench hard subset for terminal-based software engineering, system administration, game-playing, and data-processing tasks.
AudioMultiChallenge Audio Output track benchmarks spoken dialogue systems that produce audio responses in multi-turn conversations.
micro1 legal reasoning benchmark set in realistic litigation, transactional, and compliance contexts, evaluating long-horizon legal work products with IRAC-decomposed rubrics.
ACEBench is a comprehensive benchmark for evaluating Large Language Models' tool usage capabilities across three primary evaluation types: Normal (basic tool usage scenarios), Special (tool usage with ambiguous or incomplete instructions), and Agent (multi-agent interactions simulating real-world dialogues). The benchmark covers 4,538 APIs across 8 major domains and 68 sub-domains including technology, finance, entertainment, society, health, culture, and environment, supporting both English and Chinese languages.
Cost-efficient AfroBench subset for African-language model evaluation, covering representative task and dataset scores with expanded recent model coverage.
AI coding agent security benchmark measuring functional correctness and security correctness across 200 real-world tasks spanning 77 CWE classes.
Benchmark for evaluating reward models and judge systems on agent trajectories from AssistantBench, VisualWebArena, WebArena, WorkArena, and WorkArena++.
Agentset's reranker leaderboard compares reranking models in a RAG pipeline across six datasets using GPT-5 pairwise judgments and ELO ratings.
A human-centric benchmark for evaluating foundation models on standardized exams including college entrance exams (Gaokao, SAT), law school admission tests (LSAT), math competitions, lawyer qualification tests, and civil service exams. Contains 20 tasks (18 multiple-choice, 2 cloze) designed to assess understanding, knowledge, reasoning, and calculation abilities in real-world academic and professional contexts.
AI Energy Score compares model energy efficiency across language, vision, audio, and generation tasks using GPU energy per 1,000 queries and a 1-5 energy score.
Trustworthiness benchmark for LLMs covering toxicity, stereotypes, adversarial robustness, out-of-distribution robustness, adversarial demonstrations, privacy, ethics, and fairness.
Aider benchmark for model performance on code refactoring tasks.
AI Research Science benchmark evaluating autonomous ML research agents across 20 tasks sourced from state-of-the-art papers in NLP, code, math, biochemical modelling, and time-series forecasting.
Android-In-The-Zoo (AitZ) benchmark for evaluating autonomous GUI agents on smartphones. Contains 18,643 screen-action pairs with chain-of-action-thought annotations spanning over 70 Android apps. Designed to connect perception (screen layouts and UI elements) with cognition (action decision-making) for natural language-triggered smartphone task completion.
AlignBench is a comprehensive multi-dimensional benchmark for evaluating Chinese alignment of Large Language Models. It contains 8 main categories: Fundamental Language Ability, Advanced Chinese Understanding, Open-ended Questions, Writing Ability, Logical Reasoning, Mathematics, Task-oriented Role Play, and Professional Knowledge. The benchmark includes 683 real-scenario rooted queries with human-verified references and uses a rule-calibrated multi-dimensional LLM-as-Judge approach with Chain-of-Thought for evaluation.
ALL Bench LLM is a composite model leaderboard that aggregates cross-verified LLM scores across reasoning, knowledge, coding, instruction-following, and agentic benchmarks.
ALL Bench Multimodal aggregates cross-verified AI model scores across LLM, VLM, agent, image generation, video generation, and music generation categories in one unified benchmark file.
American Mathematics Competition problems from the 2022-23 academic year, consisting of multiple-choice mathematics competition problems designed for high school students. These problems require advanced mathematical reasoning, problem-solving strategies, and mathematical knowledge covering topics like algebra, geometry, number theory, and combinatorics. The benchmark is derived from the official AMC competitions sponsored by the Mathematical Association of America.
Android device control benchmark using high exact match evaluation metric for assessing agent performance on mobile interface tasks
Android control benchmark evaluating autonomous agents on mobile device interaction tasks with low exact match scoring criteria
The AI Productivity Index for Agents (APEX-Agents) measures whether frontier AI agents can execute long-horizon, cross-application tasks across three jobs in professional services.
Extended APEX-v1 benchmark for expert-level professional tasks across finance, legal, medical, and consulting domains, scored by rubric grading.
A comprehensive benchmark for tool-augmented LLMs that evaluates API planning, retrieval, and calling capabilities. Contains 314 tool-use dialogues with 753 API calls across 73 API tools, designed to assess how effectively LLMs can utilize external tools and overcome obstacles in tool leveraging.
Benchmark for interactive app-based task completion across simulated digital services, evaluating agents on tool use and stateful workflows.
ARC-AGI-2 is an upgraded benchmark for measuring abstract reasoning and problem-solving abilities in AI systems through visual grid transformation tasks. It evaluates fluid intelligence via input-output grid pairs (1x1 to 30x30) using colored cells (0-9), requiring models to identify underlying transformation rules from demonstration examples and apply them to test cases. Designed to be easy for humans but challenging for AI, focusing on core cognitive abilities like spatial reasoning, pattern recognition, and compositional generalization.
Architectural Design Records Generation: Generate Architecture Decision Records (ADRs) from given decision contexts.
Dynamic Service Generation: Generate IoT services dynamically from task descriptions and runtime context.
Microservice Generation: Generate complete microservice implementations from requirements and codebase context.
Architectural Component Generation: Generate complete serverless functions (FaaS) from specifications and codebase context.
Architecture Traceability Link Recovery: Recover traceability links between software architecture documentation and source code.
Crowdsourced Arena AI pairwise human-preference leaderboard for code generation and coding-assistant models.
Crowdsourced Arena AI pairwise human-preference leaderboard for PDF and document-understanding models.
Crowdsourced Arena AI pairwise human-preference leaderboard for image-editing models.
Crowdsourced Arena AI pairwise human-preference leaderboard for image-to-video generation models.
Crowdsourced Arena AI pairwise human-preference leaderboard for text-to-image generation models.
Crowdsourced Arena AI pairwise human-preference leaderboard for text-to-video generation models.
Crowdsourced Arena AI pairwise human-preference leaderboard for video-editing models.
Arena-Hard-Auto v2 is a challenging benchmark consisting of 500 carefully curated prompts sourced from Chatbot Arena and WildChat-1M, designed to evaluate large language models on real-world user queries. The benchmark covers diverse domains including open-ended software engineering problems, mathematics, creative writing, and technical problem-solving. It uses LLM-as-a-Judge for automatic evaluation, achieving 98.6% correlation with human preference rankings while providing 3x higher separation of model performances compared to MT-Bench. The benchmark emphasizes prompt specificity, complexity, and domain knowledge to better distinguish between model capabilities.
Benchmark for automated multimodal evaluation of visual and interactive artifact generation from code, using rendered artifacts and checklist-guided MLLM judging over 1,825 diverse tasks.
ASR-FairBench evaluates automatic speech recognition models on word error rate, runtime factor, and fairness metrics across demographic and language attributes.
BrowserGym leaderboard slice for AssistantBench web-assistance tasks requiring multi-step information seeking and tool use.
Allen AI benchmark for scientific discovery agents spanning literature understanding, code execution, data analysis, and end-to-end discovery tasks.
AttaQ is a unique dataset containing adversarial examples in the form of questions designed to provoke harmful or inappropriate responses from large language models. The benchmark evaluates safety vulnerabilities by using specialized clustering techniques that analyze both the semantic similarity of input attacks and the harmfulness of model responses, facilitating targeted improvements to model safety mechanisms.
AudioMultiChallenge Text Output track benchmarks spoken dialogue systems that produce text responses across multi-turn interactions.
AutoEval-Video evaluates large vision-language models on open-ended video question answering across nine video perception and reasoning dimensions.
AutoLogi is an automated method for synthesizing open-ended logic puzzles to evaluate reasoning abilities of Large Language Models. The benchmark addresses limitations of existing multiple-choice reasoning evaluations by featuring program-based verification and controllable difficulty levels. It includes 1,575 English and 883 Chinese puzzles, enabling more reliable evaluation that better distinguishes models' reasoning capabilities across languages.
A benchmark for early-stage visual reasoning and perception on child-like vision tasks.
Benchmark Agreement Testing leaderboard that aggregates model scores across benchmarks and analyzes benchmark agreement/correlation under a standardized BAT methodology.
BenchLM is a public aggregate LLM leaderboard that reports overall and category scores for frontier and open-weight models across agentic, coding, reasoning, multimodal-grounded, knowledge, multilingual, instruction-following, and math capabilities.
Beyond AIME is a difficult mathematical reasoning benchmark designed to test deeper reasoning chains and harder decomposition than standard AIME-style problem sets.
Berkeley Function Calling Leaderboard (BFCL) v2 is a comprehensive benchmark for evaluating large language models' function calling capabilities. It features 2,251 question-function-answer pairs with enterprise and OSS-contributed functions, addressing data contamination and bias through live, user-contributed scenarios. The benchmark evaluates AST accuracy, executable accuracy, irrelevance detection, and relevance detection across multiple programming languages (Python, Java, JavaScript) and includes complex real-world function calling scenarios with multi-lingual prompts.
Berkeley Function Calling Leaderboard (BFCL) V3 MultiTurn benchmark that evaluates large language models' ability to handle multi-turn and multi-step function calling scenarios. The benchmark introduces complex interactions requiring models to manage sequential function calls, handle conversational context across multiple turns, and make dynamic decisions about when and how to use available functions. BFCL V3 uses state-based evaluation by verifying the actual state of API systems after function execution, providing more realistic assessment of function calling capabilities in agentic applications.
Berkeley Function Calling Leaderboard v3 (BFCL-v3) is an advanced benchmark that evaluates large language models' function calling capabilities through multi-turn and multi-step interactions. It introduces extended conversational exchanges where models must retain contextual information across turns and execute multiple internal function calls for complex user requests. The benchmark includes 1000 test cases across domains like vehicle control, trading bots, travel booking, and file system management, using state-based evaluation to verify both system state changes and execution path correctness.
Berkeley Function Calling Leaderboard V4 (BFCL-V4) evaluates LLMs on their ability to accurately call functions and APIs, including simple, multiple, parallel, and nested function calls across diverse programming scenarios.
BIG-Bench Extra Hard (BBEH) is a challenging benchmark that replaces each task in BIG-Bench Hard with a novel task that probes similar reasoning capabilities but exhibits significantly increased difficulty. The benchmark contains 23 tasks testing diverse reasoning skills including many-hop reasoning, causal understanding, spatial reasoning, temporal arithmetic, geometric reasoning, linguistic reasoning, logic puzzles, and humor understanding. Designed to address saturation on existing benchmarks where state-of-the-art models achieve near-perfect scores, BBEH shows substantial room for improvement with best models achieving only 9.8-44.8% average accuracy.
BigCodeBench evaluates code generation on practical and instruction-rich programming tasks, reporting pass@1 in complete and instruct settings.
BLINK: Multimodal Large Language Models Can See but Not Perceive. A benchmark for multimodal language models focusing on core visual perception abilities. Reformats 14 classic computer vision tasks into 3,807 multiple-choice questions paired with single or multiple images and visual prompting. Tasks include relative depth estimation, visual correspondence, forensics detection, multi-view reasoning, counting, object localization, and spatial reasoning that humans can solve 'within a blink'.
Pairwise VLM-as-judge OCR benchmark ranking OCR models on British Library / BPL document images with Bradley-Terry ELO scores and bootstrap confidence intervals.
A challenging benchmark for evaluating web browsing agents' ability to persistently navigate the internet and find hard-to-locate, entangled information. Comprises 1,266 questions requiring strategic reasoning, creative search, and interpretation of retrieved content, with short and easily verifiable answers.
BrowseComp is a benchmark for measuring the ability of agents to browse the web, comprising 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. The benchmark focuses on questions where answers are obscure, time-invariant, and well-supported by evidence scattered across the open web.
A high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web, consisting of 289 multi-hop questions spanning 11 diverse domains including Film & TV, Technology, Medicine, and History. Questions are reverse-engineered from short, objective, and easily verifiable answers, requiring sophisticated reasoning and information reconciliation beyond basic retrieval. The benchmark addresses linguistic, infrastructural, and censorship-related complexities in Chinese web environments.
Deep-research benchmark variant in Exgentic's Open Agent Leaderboard, evaluating general-purpose agents on BrowseComp+ web research tasks without domain-specific tuning.
Crypto AI Agent benchmark for LLM-based agents solving web3 and crypto tasks across answer quality, reasoning, and tool-use dimensions.
A comprehensive OCR benchmark for evaluating Large Multimodal Models (LMMs) in literacy. Comprises four OCR-centric tracks: multi-scene text reading, multilingual text reading, document parsing, and key information extraction. Contains 39 subsets with 7,058 fully annotated images, 41% sourced from real applications. Tests capabilities including text grounding, multi-orientation text recognition, and detecting hallucination/repetition across diverse visual challenges.
Charades-STA is a benchmark dataset for temporal activity localization via language queries, extending the Charades dataset with sentence temporal annotations. It contains 12,408 training and 3,720 testing segment-sentence pairs from videos with natural language descriptions and precise temporal boundaries for localizing activities based on language queries.
ChartBench evaluates complex visual reasoning over charts across chart types and task combinations.
ChartQA is a large-scale benchmark comprising 9.6K human-written questions and 23.1K questions generated from human-written chart summaries, designed to evaluate models' abilities in visual and logical reasoning over charts.
CharXiv-D is the descriptive questions subset of the CharXiv benchmark, designed to assess multimodal large language models' ability to extract basic information from scientific charts. It contains descriptive questions covering information extraction, enumeration, pattern recognition, and counting across 2,323 diverse charts from arXiv papers, all curated and verified by human experts.
CharXiv-R is the reasoning component of the CharXiv benchmark, focusing on complex reasoning questions that require synthesizing information across visual chart elements. It evaluates multimodal large language models on their ability to understand and reason about scientific charts from arXiv papers through various reasoning tasks.
NVIDIA ChatRAG Bench evaluates conversational question answering over documents or retrieved context across ten derived datasets, including long-context, table reasoning, arithmetic, and unanswerable-question scenarios.
Claw Bench is a standardized leaderboard for evaluating AI agent frameworks across task completion, efficiency, security, skills, and UX dimensions.
OpenClaw agent benchmark measuring model performance on reasoning, planning, tool use, reliability, efficiency, and safety across repeated runs.
Clembench Multimodal evaluates chat-optimized multimodal models as conversational agents through visual language games, tracking Clemscore, played percentage, quality score, and task-level metrics.
Clembench evaluates chat-optimized language models as conversational agents through language games; this v3.0 text leaderboard tracks Clemscore, played percentage, quality score, and task-level game metrics.
CLUEWSC2020 is the Chinese version of the Winograd Schema Challenge, part of the CLUE benchmark. It focuses on pronoun disambiguation and coreference resolution, requiring models to determine which noun a pronoun refers to in a sentence. The dataset contains 1,244 training samples and 304 development samples extracted from contemporary Chinese literature.
CMMLU (Chinese Massive Multitask Language Understanding) is a comprehensive Chinese benchmark that evaluates the knowledge and reasoning capabilities of large language models across 67 different subject topics. The benchmark covers natural sciences, social sciences, engineering, and humanities with multiple-choice questions ranging from basic to advanced professional levels.
Codegolf v2.2 benchmark
COLLIE is a grammar-based framework for systematic construction of constrained text generation tasks. It allows specification of rich, compositional constraints across diverse generation levels and modeling challenges including language understanding, logical reasoning, and semantic planning. The COLLIE-v1 dataset contains 2,080 instances across 13 constraint structures.
ComplexFuncBench is a benchmark designed to evaluate large language models' capabilities in handling complex function calling scenarios. It encompasses multi-step and constrained function calling tasks that require long-parameter filling, parameter value reasoning, and managing contexts up to 128k tokens. The benchmark includes 1,000 samples across five real-world scenarios.
Long-context benchmark leaderboard for multi-needle retrieval and reasoning across increasing context lengths, reported as GDM-MRCRv2 scores.
Benchmark for context retrieval in coding agents, measuring how well agents retrieve and use multi-file code context before producing fixes.
CorpusQA 1M is a long-context question answering benchmark designed to evaluate models at approximately 1 million token contexts. Models are scored on accuracy when retrieving and reasoning over information distributed across an extremely long input corpus.
CountBench evaluates object counting capabilities in visual understanding.
CoVoST 2 is a large-scale multilingual speech translation corpus derived from Common Voice, covering translations from 21 languages into English and from English into 15 languages. The dataset contains 2,880 hours of speech with 78K speakers for speech translation research.
CRAG (Comprehensive RAG Benchmark) is a factual question answering benchmark consisting of 4,409 question-answer pairs across 5 domains (finance, sports, music, movie, open domain) and 8 question categories. The benchmark includes mock APIs to simulate web and Knowledge Graph search, designed to represent the diverse and dynamic nature of real-world QA tasks with temporal dynamism ranging from years to seconds. It evaluates retrieval-augmented generation systems for trustworthy question answering.
EQ-Bench Creative Writing v3 is an LLM-judged creative writing benchmark that evaluates models across 32 writing prompts with 3 iterations per prompt. Uses a hybrid scoring system combining rubric assessment and Elo ratings through pairwise comparisons. Challenges models in areas like humor, romance, spatial awareness, and unique perspectives to assess emotional intelligence and creative writing capabilities.
Chinese SimpleQA is the first comprehensive Chinese benchmark to evaluate the factuality ability of language models to answer short questions. It contains 3,000 high-quality questions spanning 6 major topics with 99 diverse subtopics, designed to assess Chinese factual knowledge across humanities, science, engineering, culture, and society.
CyberGym is a benchmark for evaluating AI agents on cybersecurity tasks, testing their ability to identify vulnerabilities, perform security analysis, and complete security-related challenges in a controlled environment.
Cybersecurity Capture the Flag (CTF) benchmark for evaluating LLMs in offensive security challenges. Contains diverse cybersecurity tasks including cryptography, web exploitation, binary analysis, and forensics to assess AI capabilities in cybersecurity problem-solving.
Data Agent Benchmark for Multi-step Reasoning evaluates data-analysis agents on real-world, multi-step tasks over structured and unstructured business data.
LLM-agent active-learning benchmark with budgeted search results, trap-free rate, query usage, and token-count telemetry.
DeepPlanning evaluates LLMs on complex multi-step planning tasks requiring long-horizon reasoning, goal decomposition, and strategic decision-making.
Benchmark for deep research agents that evaluates generated research reports across comprehensiveness, insight, instruction following, readability, and citation dimensions.
DeepSearchQA is a benchmark for evaluating deep search and question-answering capabilities, testing models' ability to perform multi-hop reasoning and information retrieval across complex knowledge domains.
Crowdsourced arena benchmark for AI-generated design outputs, including code-generation models, website builders, and agentic app-building systems across design and web-app tasks.
Design2Code evaluates the ability to generate code (HTML/CSS/JS) from visual designs.
Domain-specific code generation benchmark across healthcare systems, financial algorithms, molecular simulation, and legal document processing, scored for functional correctness, compliance, domain API coverage, code quality, and reference similarity.
DROP (Discrete Reasoning Over Paragraphs) is a reading comprehension benchmark requiring discrete reasoning over paragraph content. It contains crowdsourced, adversarially-created questions that require resolving references and performing discrete operations like addition, counting, or sorting, demanding comprehensive paragraph understanding beyond paraphrase-and-entity-typing shortcuts.
A multilingual closed-book question answering dataset that evaluates cross-lingual knowledge transfer in large language models across 12 languages, using knowledge-seeking questions based on Wikipedia articles that exist only in one language
NYU NAIRR edge LLM leaderboard measuring local LLM variants on Raspberry Pi 5 (8GB), combining MMLU accuracy with prefill/decode throughput, model size, quantization, and backend metadata.
EmbSpatialBench evaluates embodied spatial understanding and reasoning capabilities.
EnigmaEval is a benchmark from puzzle hunts, testing AI with complex reasoning, creative problem-solving, and cross-domain knowledge synthesis.
Leaderboard for RAG systems on EnterpriseRAG-Bench, a benchmark of company-internal knowledge retrieval and answer generation across 500 enterprise questions.
Embodied Reasoning Question Answering benchmark consisting of 400 multiple-choice visual questions across spatial reasoning, trajectory reasoning, action reasoning, state estimation, and multi-view reasoning for evaluating AI capabilities in physical world interactions
Benchmark for evaluating model robustness against evasion-style safety and policy-circumvention prompts.
A fine-grained atomic evaluation metric for factual precision in long-form text generation that breaks generated text into atomic facts and computes the percentage supported by reliable knowledge sources, with automated assessment using retrieval and language models
Fiction comprehension and reasoning benchmark for assessing model understanding over narrative text.
Functional metacognitive reasoning benchmark evaluating whether language models can identify uncertainty, detect inconsistencies, recover from errors, and correct their own reasoning.
FinanceBench evaluates language models on financial analysis questions with source documents, gold answers, and human-annotated model completions.
Flexible Length Question Answering dataset for evaluating the impact of input length on reasoning performance of language models, featuring True/False questions embedded in contexts of varying lengths (250-3000 tokens) across three reasoning tasks: Monotone Relations, People In Rooms, and simplified Ruletaker
Few-shot Learning Evaluation of Universal Representations of Speech - a parallel speech dataset in 102 languages built on FLoRes-101 with approximately 12 hours of speech supervision per language for tasks including ASR, speech language identification, translation and retrieval
European-language subset of FLORES-200 machine translation, focusing on translation pairs involving Polish and other European languages.
Factuality, Retrieval, And reasoning MEasurement Set - a unified evaluation dataset of 824 challenging multi-hop questions for testing retrieval-augmented generation systems across factuality, retrieval accuracy, and reasoning capabilities, requiring integration of 2-15 Wikipedia articles per question
Private FrontierMath research-level mathematics benchmark snapshot.
Private Tier 4 FrontierMath problems at research-level mathematical difficulty.
English subset of FullStackBench for evaluating end-to-end software engineering and full-stack development capability.
Chinese subset of FullStackBench for evaluating end-to-end software engineering and full-stack development capability.
A functional variant of the MATH benchmark that tests language models' ability to generalize reasoning patterns across different problem instances, revealing the reasoning gap between static and functional performance.
Agentic task leaderboard ranking LLMs across banking, healthcare, insurance, investment, and telecom workflows with accuracy, trajectory quality, cost, latency, and turn-count metrics.
GDPval-AA is an evaluation of AI model performance on economically valuable knowledge work tasks across professional domains including finance, legal, and other sectors. Run independently by Artificial Analysis, it uses Elo scoring to rank models on real-world work task performance.
GDPval-MM is the multimodal variant of the GDPval benchmark, evaluating AI model performance on real-world economically valuable tasks that require processing and generating multimodal content including documents, slides, diagrams, spreadsheets, images, and other professional deliverables across diverse industries.
GeneBench is an evaluation focused on multi-stage scientific data analysis in genetics and quantitative biology. Tasks require reasoning about ambiguous or noisy data with minimal supervisory guidance, addressing realistic obstacles such as hidden confounders or QC failures, and correctly implementing and interpreting modern statistical methods.
Benchmark for generating structured README files from entire GitHub repositories, evaluating long-context codebase summarization with BLEU, ROUGE, semantic similarity, structure, information retrieval, code consistency, and readability metrics.
Geospatial foundation model benchmark covering remote-sensing classification, segmentation, detection, and regression datasets with repeated-seed public submissions.
Global PIQA is a multilingual commonsense reasoning benchmark that evaluates physical interaction knowledge across 100 languages and cultures. It tests AI systems' understanding of physical world knowledge in diverse cultural contexts through multiple choice questions about everyday situations requiring physical commonsense.
A lightweight version of Global MMLU benchmark that evaluates language models across multiple languages while addressing cultural and linguistic biases in multilingual evaluation.
APIBench, a comprehensive dataset of over 11,000 instruction-API pairs from HuggingFace, TorchHub, and TensorHub APIs for evaluating language models' ability to generate accurate API calls.
A long document summarization dataset consisting of reports from government research agencies including Congressional Research Service and U.S. Government Accountability Office, with significantly longer documents and summaries than other datasets.
A graph reasoning benchmark that evaluates language models' ability to perform breadth-first search (BFS) operations on graphs with context length under 128k tokens, returning nodes reachable at specified depths.
A graph reasoning benchmark that evaluates language models' ability to perform breadth-first search (BFS) operations on graphs with context length over 128k tokens, testing long-context reasoning capabilities.
A graph reasoning benchmark that evaluates language models' ability to find parent nodes in graphs with context length under 128k tokens, requiring understanding of graph structure and edge relationships.
A graph reasoning benchmark that evaluates language models' ability to find parent nodes in graphs with context length over 128k tokens, testing long-context reasoning and graph structure understanding.
A subset of GroundUI-18K for UI grounding evaluation, where models must predict action coordinates on screenshots based on single-step instructions across web, desktop, and mobile platforms.
Grade School Math 8K with Chain-of-Thought prompting, featuring 8.5K high-quality linguistically diverse grade school math word problems requiring multi-step reasoning and elementary arithmetic operations.
Public leaderboard evaluating LLM factuality, faithfulness, hallucination detection, instruction following, QA, reading comprehension, and summarization tasks.
HallusionBench evaluates multimodal large language models on visual illusion and hallucination-style image-text reasoning cases.
AI agent safety benchmark measuring how often autonomous LLM agents avoid harmful tool actions under adversarial pressure.
Event-driven longitudinal health-agent benchmark over synthetic patient trajectories, evaluating lookup, trend, comparison, anomaly, and explanation capabilities.
Google DeepMind's internal mathematical reasoning benchmark that introduces novel problems not encountered during model training to evaluate true mathematical reasoning capabilities rather than memorization
Hindi generative-task benchmark for chat and instruct models, evaluated with the 3C3H rubric across Hindi QA, grammar, and safety tasks.
LLM leaderboard for Hindsight agent-memory operations, measuring retain(), reflect(), and quality performance over memory extraction and recall workloads.
Harvard-MIT Mathematics Tournament 2025 - A prestigious student-organized mathematics competition for high school students featuring two tournaments (November 2025 at MIT and February 2026 at Harvard) with individual tests, team rounds, and guts rounds
Official Hugging Face benchmark for model performance on the February 2026 Harvard-MIT Mathematics Tournament problem set.
VEX code generation and understanding benchmark for SideFX Houdini shader/programming tasks, covering code completion, doc-to-code, and code explanation.
A multilingual variant of the HumanEval benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics
Humanity's Last Exam text-only leaderboard evaluates frontier LLMs using text-based expert questions, excluding multimodal content.
Arena leaderboard for IDEA-Bench image generation systems, reporting anonymous and full Arena Elo ratings for text-guided image generation/editing pipelines.
Document AI leaderboard combining OCR, table extraction, key information extraction, and visual question answering scores from OlmOCR, OmniDocBench, and IDP Core evaluations.
IMO-AnswerBench is a benchmark for evaluating mathematical reasoning capabilities on International Mathematical Olympiad (IMO) problems, focusing on answer generation and verification.
Impermanent is a live benchmark for temporal generalization in time-series forecasting, evaluating models sequentially as GitHub activity outcomes arrive.
European-language slice of INCLUDE-base-44, evaluating multilingual LLMs on knowledge- and reasoning-centric multiple-choice questions across 20 European languages.
PrunaAI InferBench text-to-image leaderboard comparing image-generation inference providers on quality, median latency, and price per image.
InfoVQA dataset with 30,000 questions and 5,000 infographic images requiring joint reasoning over document layout, textual content, graphical elements, and data visualizations with elementary reasoning and arithmetic skills
InfoVQA test set with infographic images requiring joint reasoning over document layout, textual content, graphical elements, and data visualizations with elementary reasoning and arithmetic skills
Pairwise VLM-as-judge OCR benchmark ranking OCR models on InkBench handwriting/document images with Bradley-Terry ELO scores.
InvisibleBench evaluates caregiver-support AI systems for relational harms, fail-closed safety and compliance gates, communication quality, coordination, and boundary integrity.
Pairwise VLM-as-judge OCR benchmark ranking OCR models on ISL FinePDFs document images with Bradley-Terry ELO scores.
Benchmark for evaluating large language models on Islamic law and jurisprudence, with public aggregate scores over 718 private instances across 13 tasks.
KernelBench Hard evaluates autonomous coding agents on GPU kernel engineering tasks, measuring correctness and speed relative to hardware baselines.
Hallucination-detection leaderboard reporting RAG hallucination rates on HaluEval-QA and non-RAG hallucination rates on UltraChat-style prompts.
Leaderboard for language models across Spanish, Catalan, Basque, and Galician benchmarks, with normalized multilingual task performance and language-specific aggregate scores.
LanguageBench evaluates AI models across many languages and tasks, including translation, classification, multilingual Q&A, advanced Q&A, and math.
Leaderboard for model performance on Latin American and Iberian language tasks, including Spanish and Portuguese understanding, TELEIA Spanish exams, FLORES/OPUS translation, and structured or image extraction.
Voice AI infrastructure latency benchmark covering LLM time-to-first-token, speech-to-text latency, text-to-speech time-to-first-byte, and pipeline combinations across providers.
Writing quality and style evaluation benchmark tracked in Epoch AI's capabilities dataset.
Language identification benchmark comparing LID models across FLORES+, MADAR, Gherbal-Multi, ATLASIA-LID, WiLI-2018, CommonLID, and Bouquet.
Linguistics reasoning benchmark evaluating models on baseline and obfuscated questions to separate reasoning ability from memorization.
Official LiveCodeBench code-generation leaderboard for contamination-aware coding evaluation over problems collected from Codeforces, LeetCode, and AtCoder.
Dynamic contamination-free text-to-SQL benchmark for real-world database tasks, including business-intelligence queries, CRUD/management SQL, hierarchical knowledge bases, and large industrial-scale database variants.
Grid-based game competition benchmark evaluating LLM strategic play and invalid-move behavior in Tic-Tac-Toe, Connect Four, and Gomoku.
Benchmark for long-term planning and reasoning over real-world knowledge graphs where models navigate Wikipedia hyperlinks from a source page to a target page.
Crowdsourced pairwise human-preference leaderboard for search-augmented AI systems in LMArena.
Crowdsourced pairwise human-preference leaderboard for text chat models in LMArena, formerly LMSYS Chatbot Arena.
Crowdsourced pairwise human-preference leaderboard for vision-language model responses in LMArena.
LMArena's WebDev Arena leaderboard for model performance on interactive web development tasks judged by human preference.
Community leaderboard for local LLM inference speed across model, hardware, engine, quantization, context length, and batch-size configurations.
Mozilla Builders local LLM hardware benchmark measuring prompt processing speed, generation speed, time to first token, and an aggregate LocalScore for model-and-accelerator configurations.
LongVideoBench is a question-answering benchmark featuring video-language interleaved inputs up to an hour long. It includes 3,763 varying-length web-collected videos with subtitles across diverse themes and 6,678 human-annotated multiple-choice questions in 17 fine-grained categories for comprehensive evaluation of long-term multimodal understanding.
LOVEU-TGVE is the CVPR 2023 LOVEU Workshop text-guided video editing competition leaderboard, ranking submitted systems by human evaluation and automated video-editing metrics.
LVBench is an extreme long video understanding benchmark designed to evaluate multimodal models on videos up to two hours in duration. It contains 6 major categories and 21 subcategories, with videos averaging five times longer than existing datasets. The benchmark addresses applications requiring comprehension of extremely long videos.
MATH dataset contains 12,500 challenging competition mathematics problems from AMC 10, AMC 12, AIME, and other mathematics competitions. Each problem includes full step-by-step solutions and spans multiple difficulty levels (1-5) across seven mathematical subjects. This variant uses Chain-of-Thought prompting to encourage step-by-step reasoning.
MATH-500 is a subset of the MATH dataset containing 500 challenging competition mathematics problems from AMC 10, AMC 12, AIME, and other mathematics competitions. Each problem includes full step-by-step solutions and spans multiple difficulty levels across seven mathematical subjects including Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus.
MathArena Apex is a challenging math contest benchmark featuring the most difficult mathematical problems designed to test advanced reasoning and problem-solving abilities of AI models. It focuses on olympiad-level mathematics and complex multi-step mathematical reasoning.
MathVision evaluates multimodal mathematical reasoning on a full 3,040-example visual math test set.
Benchmark for LLMs and agents using real-world Model Context Protocol servers across location navigation, repository management, finance, 3D design, browser automation, and web search tasks.
Medical chronology extraction benchmark evaluating LLMs on structured timeline extraction from synthetic medical-legal records across six golden datasets and three generation rounds.
Medical and surgical video understanding benchmark for video large language models, covering 6,245 test samples across eight tasks including temporal action localization, spatiotemporal grounding, captioning, next-action prediction, CVS assessment, video summary, region captioning, and surgical skill assessment.
A comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning, featuring 4,460 questions spanning 17 specialties and 11 body systems. Includes both text-only and multimodal subsets with expert-level exam questions incorporating diverse medical images and rich clinical information.
MLQA as part of the MEGA (Multilingual Evaluation of Generative AI) benchmark suite. A multi-way aligned extractive QA evaluation benchmark for cross-lingual question answering across 7 languages (English, Arabic, German, Spanish, Hindi, Vietnamese, and Simplified Chinese) with over 12K QA instances in English and 5K in each other language.
TyDi QA as part of the MEGA benchmark suite. A question answering dataset covering 11 typologically diverse languages (Arabic, Bengali, English, Finnish, Indonesian, Japanese, Korean, Russian, Swahili, Telugu, and Thai) with 204K question-answer pairs. Features realistic information-seeking questions written by people who want to know the answer but don't know it yet.
Universal Dependencies POS tagging as part of the MEGA benchmark suite. A multilingual part-of-speech tagging dataset based on Universal Dependencies treebanks, utilizing the universal POS tag set of 17 tags across 38 diverse languages from different language families. Used for evaluating multilingual POS tagging systems.
XCOPA (Cross-lingual Choice of Plausible Alternatives) as part of the MEGA benchmark suite. A typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages, including resource-poor languages like Eastern Apurímac Quechua and Haitian Creole. Requires models to select which choice is the effect or cause of a given premise.
XStoryCloze as part of the MEGA benchmark suite. A cross-lingual story completion task that consists of professionally translated versions of the English StoryCloze dataset to 10 non-English languages. Requires models to predict the correct ending for a given four-sentence story, evaluating commonsense reasoning and narrative understanding.
MixEval Chat reports chat-model results for MixEval and MixEval-Hard, dynamic benchmark mixtures designed to approximate real-world user-facing LLM capability with strong correlation to Chatbot Arena.
A comprehensive benchmark for multi-task long video understanding that evaluates multimodal large language models on videos ranging from 3 minutes to 2 hours across 9 distinct tasks including reasoning, captioning, recognition, and summarization.
Benchmark for evaluating LLM proficiency with Apple's MLX machine learning framework across 520 questions, 11 categories, 6 question types, and 4 difficulty levels.
A multimodal web navigation benchmark comprising 2,000 open-ended tasks spanning 137 websites across 31 domains. Each task includes HTML documents paired with webpage screenshots, action sequences, and complex web interactions.
A multi-turn LLM-as-a-judge evaluation benchmark for testing multimodal instruction-tuned models' ability to follow user instructions in multi-turn dialogues and answer open-ended questions in a zero-shot manner.
A long-form multi-shot benchmark for holistic video understanding that incorporates approximately 600 web videos from YouTube spanning 16 major categories, with each video ranging from 30 seconds to 6 minutes. Includes roughly 2,000 original question-answer pairs covering 26 fine-grained capabilities.
Long-context multimodal document understanding benchmark evaluating vision-language and omni models on document comprehension accuracy.
Chain-of-Thought variant of the Massive Multitask Language Understanding benchmark, evaluating language models across 57 tasks including elementary mathematics, US history, computer science, law, and other professional and academic subjects. This version uses chain-of-thought prompting to elicit step-by-step reasoning.
Extended version of MMLU-Pro providing additional challenging multiple-choice questions for evaluating language models across diverse academic and professional domains. Built on the foundation of the Massive Multitask Language Understanding benchmark framework.
An improved version of the MMLU benchmark featuring manually re-annotated questions to identify and correct errors in the original dataset. Provides more reliable evaluation metrics for language models by addressing dataset quality issues found in the original MMLU.
STEM-focused subset of the Massive Multitask Language Understanding benchmark, evaluating language models on science, technology, engineering, and mathematics topics including physics, chemistry, mathematics, and other technical subjects.
MM-Vet is an evaluation benchmark that examines large multimodal models on complicated multimodal tasks requiring integrated capabilities. It assesses six core vision-language capabilities: recognition, knowledge, spatial awareness, language generation, OCR, and math through questions that require one or more of these capabilities.
MMVU (Multimodal Multi-disciplinary Video Understanding) is a benchmark for evaluating multimodal models on video understanding tasks across multiple disciplines, testing comprehension and reasoning capabilities on video content.
MobileMiniWob++ SR (Success Rate) is an adaptation of the MiniWob++ web interaction benchmark for mobile Android environments within AndroidWorld. It comprises 92 web interaction tasks adapted for touch-based mobile interfaces, evaluating agents' ability to navigate and interact with web applications on mobile devices.
Leaderboard for automatic speech recognition on Moroccan Darija, reporting word error rate and character error rate.
MRCR 1M is a variant of the Multi-Round Coreference Resolution benchmark designed for testing extremely long context capabilities with approximately 1 million tokens. It evaluates models' ability to maintain reasoning and attention across ultra-long conversations.
MRCR v2 (Multi-Round Coreference Resolution version 2) is an enhanced version of the synthetic long-context reasoning task. It extends the original MRCR framework with improved evaluation criteria and additional complexity for testing models' ability to maintain attention and reasoning across extended contexts.
MRCR v2 (8-needle) is a variant of the Multi-Round Coreference Resolution benchmark that includes 8 needle items to retrieve from long contexts. This tests models' ability to simultaneously track and reason about multiple pieces of information across extended conversations.
A comprehensive benchmark for robust multi-image understanding capabilities of multimodal LLMs. Consists of 12 diverse multi-image tasks involving 10 categories of multi-image relations (e.g., multiview, temporal relations, narrative, complementary). Comprises 11,264 images and 2,600 multiple-choice questions created in a pairwise manner, where each standard instance is paired with an unanswerable variant for reliable assessment.
Multi-IF benchmarks LLMs on multi-turn and multilingual instruction following. It expands upon IFEval by incorporating multi-turn sequences and translating English prompts into 7 other languages, resulting in 4,501 multilingual conversations with three turns each. The benchmark reveals that current leading LLMs struggle with maintaining accuracy in multi-turn instructions and shows higher error rates for non-Latin script languages.
A multilingual benchmark for issue resolving that evaluates Large Language Models' ability to resolve software issues across diverse programming ecosystems. Covers 7 programming languages (Java, TypeScript, JavaScript, Go, Rust, C, and C++) with 1,632 high-quality instances carefully annotated by 68 expert annotators. Addresses limitations of existing benchmarks that focus almost exclusively on Python.
MultiChallenge evaluates frontier LLMs on realistic multi-turn conversations, assessing instruction retention, inference memory, and self-coherence.
Multilingual Grade School Math (MGSM) benchmark evaluates language models' chain-of-thought reasoning abilities across ten typologically diverse languages. Contains 250 grade-school math problems manually translated from GSM8K dataset into languages including Bengali and Swahili.
MongoDB text-to-query benchmark evaluating natural-language generation of mongosh queries with execution, output, normalization, latency, and token metrics.
NaturalCodeBench (NCB) is a challenging code benchmark designed to mirror the complexity and variety of real-world coding tasks. It comprises 402 high-quality problems in Python and Java, selected from natural user queries from online coding services, covering 6 different domains.
Centrally scored long-context needle-retrieval benchmark on dense scientific paper text, with haystacks from 50K through 1M tokens.
Public leaderboard for proprietary command-following, distractor-resistance, expectation-breaking, poem, and stylized-writing tests run mainly on open-source LLM variants.
Nexus Function Calling: Evaluates tool calling, API use, function selection, structured arguments, and multi-step tool workflows.
NYU LLM CTF evaluates autonomous agents on a 200-challenge capture-the-flag benchmark covering crypto, forensics, misc, pwn, rev, and web tasks.
Pairwise VLM-as-judge OCR benchmark ranking OCR models on Encyclopaedia Britannica document images with Bradley-Terry ELO scores.
Pairwise VLM-as-judge OCR benchmark ranking OCR models on UFO document images with Bradley-Terry ELO scores.
OCRBench v2 evaluates large multimodal models on bilingual visual text localization and reasoning tasks.
OCRBench v2 English subset: Enhanced benchmark for evaluating Large Multimodal Models on visual text localization and reasoning with English text content
OCRBench v2 Chinese subset: Enhanced benchmark for evaluating Large Multimodal Models on visual text localization and reasoning with Chinese text content
OJBench is a competition-level code benchmark designed to assess the competitive-level code reasoning abilities of large language models. It comprises 232 programming competition problems from NOI and ICPC, categorized into Easy, Medium, and Hard difficulty levels. The benchmark evaluates models' ability to solve complex competitive programming challenges using Python and C++.
Document OCR benchmark from AllenAI for measuring OCR model quality on varied real-world document pages.
OlympiadBench evaluates bilingual olympiad-level mathematical and physics reasoning, including multimodal and text-only problem settings.
Open Agent Leaderboard math-reasoning track comparing prompting and agent algorithms across GSM8K, AQuA, and MATH-500 with score and cost metrics.
Open Agent Leaderboard multimodal track comparing visual-agent configurations by score, pass rate, and token usage.
OCR and data extraction leaderboard comparing traditional OCR providers and multimodal LLM systems on 1,000 pages.
OmniDocBench 1.5 is a comprehensive benchmark for evaluating multimodal large language models on document understanding tasks, including OCR, document parsing, information extraction, and visual question answering across diverse document types. Lower Overall Edit Distance scores are better.
Open Arabic LLM Leaderboard v2 evaluating Arabic and Arabic-interested language models across AlGhafa, ArabicMMLU, EXAMS, MadinahQA, AraTrust, ALRAGE, and ArbMMLU-HT.
Hugging Face Open ASR Leaderboard for speech recognition models across diverse public automatic speech recognition benchmarks.
BAAI leaderboard for Chinese-oriented LLM evaluation across C-ARC, C-HellaSwag, C-TruthfulQA, C-Winogrande, C-GSM8K, C-SEM, C-MMLU, and CLCC-H.
Italian LLM leaderboard evaluating open models on Italian M-MMLU, Belebele, HellaSwag, LAMBADA, XCOPA, and ARC tasks.
LLM-jp open Japanese LLM leaderboard evaluating Japanese and multilingual models across code generation, entity linking, factual association, historical events, commonsense, math, translation, NLI, QA, reading comprehension, and summarization.
Upstage Open Ko-LLM leaderboard evaluating Korean language model performance across translated reasoning, instruction following, safety, helpfulness, EQ, GSM8K, GPQA, Winogrande, and KorNAT tasks.
Open LLM Leaderboard v2 aggregates model evaluations across IFEval, BBH, MATH Level 5, GPQA, MuSR, and MMLU-PRO for open-weight language models.
Open Life Science AI leaderboard evaluating LLMs on medical QA and medical MMLU tasks, including PubMedQA, MedQA, MedMCQA, and six medical MMLU subjects.
Open Multilingual LLM Evaluation Leaderboard evaluates language models across non-English languages on translated ARC, HellaSwag, MMLU, and TruthfulQA tasks.
Portuguese LLM leaderboard evaluating models on ASSIN2 RTE, ASSIN2 STS, FaQuAD NLI, and HateBR offensive-language tasks.
Multi-dialect Arabic ASR leaderboard reporting average WER/CER and dataset-level WER/CER on SADA, Common Voice, MASC, MGB-2, and Casablanca.
Blind ADMET prediction challenge for pan-coronavirus drug-discovery data, evaluating submitted models across nine absorption, distribution, metabolism, excretion, and toxicity endpoints with a final blinded leaderboard.
Multi-round Co-reference Resolution (MRCR) benchmark for evaluating an LLM's ability to distinguish between multiple needles hidden in long context. Models are given a long, multi-turn synthetic conversation and must retrieve a specific instance of a repeated request, requiring reasoning and disambiguation skills beyond simple retrieval.
Multi-Round Co-reference Resolution benchmark that tests an LLM's ability to distinguish between multiple similar needles hidden in long conversations. Models must reproduce specific instances of content (e.g., 'Return the 2nd poem about tapirs') from multi-turn synthetic conversations, requiring reasoning about context, ordering, and subtle differences between similar outputs.
OpenBookQA is a question-answering dataset modeled after open book exams for assessing human understanding. It contains 5,957 multiple-choice elementary-level science questions that probe understanding of 1,326 core science facts and their application to novel situations, requiring combination of open book facts with broad common knowledge through multi-hop reasoning.
OpenClaw Arena configuration leaderboard measuring how different SOUL.md-style personal-agent configurations affect GPT-4.1 performance.
Personal AI agent benchmark evaluating frontier models across real-world OpenClaw-style tasks.
Holistic software-engineering agent benchmark from OpenHands covering issue resolution, multimodal bug fixing, app creation, test generation, and information gathering tasks.
OpenHuEval evaluates large language models on Hungarian-specific tasks, including real user queries, self-awareness, proverb reasoning, generative evaluation, and fill-in-the-blank tasks.
Public leaderboard tracking Uncensored General Intelligence and willingness-to-answer scores for AI models on undisclosed sensitive-topic evaluations.
Benchmark for MCP tool invocation in computer-use agents on OSWorld-style desktop tasks.
OSWorld-Verified is a verified subset of OSWorld, a scalable real computer environment for multimodal agents supporting task setup, execution-based evaluation, and interactive learning across Ubuntu, Windows, and macOS.
Competition-level math problems from OTIS Mock AIME evaluating olympiad-level math.
Benchmark for AI scientific paper writing quality using multi-LLM granular scoring, Lean4 formal verification, tribunal examination, inflation correction, and score-weighted peer voting.
Comprehensive LLM safety benchmark for jailbreak attacks, defense mechanisms, judges, and safety-capability tradeoffs, aggregating attack success rates and AlpacaEval capability scores by model and defense method.
Document parsing benchmark for AI agents over enterprise documents, evaluating tables, charts, content faithfulness, semantic formatting, and visual grounding.
A novel multimodal video benchmark designed to evaluate perception and reasoning skills of pre-trained models across video, audio, and text modalities. Contains 11.6k real-world videos (average 23 seconds) filmed by participants worldwide, densely annotated with six types of labels. Focuses on skills (Memory, Abstraction, Physics, Semantics) and reasoning types (descriptive, explanatory, predictive, counterfactual). Shows significant performance gap between human baseline (91.4%) and state-of-the-art video QA models (46.2%).
PhiBench is an internal benchmark designed to evaluate diverse skills and reasoning abilities of language models, covering a wide range of tasks including coding (debugging, extending incomplete code, explaining code snippets) and mathematics (identifying proof errors, generating related problems). Created by Microsoft's research team to address limitations of standard academic benchmarks and guide the development of the Phi-4 model.
Physical AI Bench conditional generation leaderboard for controlled world-model generation under blur, edge, depth, segmentation, and combined conditioning settings.
Physical AI Bench generation leaderboard for world models predicting future states across autonomous driving, robotics, industry, human, physics, and common-sense scenarios.
Physical AI Bench understanding leaderboard for embodied physical reasoning over common sense, space, time, physics, robotics, autonomous driving, and video QA datasets.
PHYSICS is a comprehensive benchmark for university-level physics problem solving, containing 1,297 expert-annotated problems covering six core areas: classical mechanics, quantum mechanics, thermodynamics and statistical mechanics, electromagnetism, atomic physics, and optics. Each problem requires advanced physics knowledge and mathematical reasoning. Even advanced models like o3-mini achieve only 59.9% accuracy.
Leaderboard for personally identifiable information masking models across OpenPII, Gretel PII masking, Nemotron-PII, and Privy datasets, using average F2 as the primary metric.
Real-world OpenClaw agent benchmark evaluating how LLMs perform as the model inside an agent across practical coding, scheduling, research, email, and file-management workflows.
Polish Massive Text Embedding Benchmark leaderboard evaluating embedding models across Polish classification, clustering, pair classification, semantic textual similarity, and retrieval tasks.
Polymath is a challenging multi-modal mathematical reasoning benchmark designed to evaluate the general cognitive reasoning abilities of Multi-modal Large Language Models (MLLMs). The benchmark comprises 5,000 manually collected high-quality images of cognitive textual and visual challenges across 10 distinct categories, including pattern recognition, spatial reasoning, and relative reasoning.
PolyMath is a multilingual mathematical reasoning benchmark covering 18 languages and 4 difficulty levels from easy to hard, ensuring difficulty comprehensiveness, language diversity, and high-quality translation. The benchmark evaluates mathematical reasoning capabilities of large language models across diverse linguistic contexts, making it a highly discriminative multilingual mathematical benchmark.
Open community benchmark for local LLM inference on consumer, prosumer, and small-business hardware, measuring throughput, latency, power, long-context behavior, tool-call accuracy, answer quality, and memory-oriented tasks.
PopQA is an entity-centric open-domain question-answering dataset consisting of 14,000 QA pairs designed to evaluate language models' ability to memorize and recall factual knowledge across entities with varying popularity levels. The dataset probes both parametric memory (stored in model parameters) and non-parametric memory effectiveness, with questions covering 16 diverse relationship types from Wikidata converted to natural language using templates. Created by sampling knowledge triples from Wikidata and converting them to natural language questions, focusing on long-tail entities to understand LMs' strengths and limitations in memorizing factual knowledge.
Benchmark measuring how well CLI agents can autonomously post-train small language models under a fixed H100 and 10-hour budget.
Professional Reasoning Bench Finance evaluates frontier LLMs on complex financial reasoning tasks including analysis, modeling, and decision-making.
PROBE protein-protein binding affinity estimation benchmark, reporting mean squared error for representation methods with available affinity evaluations.
PROBE drug-target protein family classification benchmark, reporting MCC across random and sequence-identity split settings.
PROBE ontology-based protein function prediction benchmark, reporting F1 across molecular function, biological process, and cellular component targets.
PROBE semantic similarity benchmark for protein representation methods, measuring correlation with Gene Ontology molecular function, biological process, and cellular component similarities.
Professional Reasoning Bench Legal evaluates frontier LLMs on complex legal reasoning tasks drawn from real-world legal practice and case analysis.
End-to-end project development benchmark evaluating coding agents on complete executable software repository construction from high-level specifications.
QASPER is a dataset of 5,049 information-seeking questions and answers anchored in 1,585 NLP research papers. Questions are written by NLP practitioners who read only titles and abstracts, while answers require understanding the full paper text and provide supporting evidence. The dataset challenges models with complex reasoning across document sections for academic document question answering. Each question seeks information present in the full text and is answered by a separate set of NLP practitioners who also provide supporting evidence to answers.
QMSum is a benchmark for query-based multi-domain meeting summarization consisting of 1,808 query-summary pairs over 232 meetings across academic, product, and committee domains. The dataset enables models to select and summarize relevant spans of meetings in response to specific queries. Published at NAACL 2021, QMSum presents significant challenges in long meeting summarization where models must identify and summarize relevant content based on user queries.
RefCOCO-avg measures object grounding accuracy averaged across RefCOCO, RefCOCO+, and RefCOCOg benchmarks.
RefSpatialBench evaluates spatial reference understanding and grounding.
Reward model benchmark evaluating preference models across chat, hard chat, safety, reasoning, and prior preference-evaluation sets.
RubricEval is a scalable framework for evaluating instruction-following models on open-ended tasks using example-specific human-authored rubrics and GPT-4o grading.
RULER v1 is a synthetic long-context benchmark for measuring how model quality degrades as input length increases. This packaging follows the public standalone NVIDIA RULER implementation with 13 official tasks spanning retrieval, multi-hop tracing, aggregation, and QA.
SAFE Audio Challenge Task 1 evaluates systems for detecting generated audio against pristine audio sources.
SAFE Audio Challenge Task 2 evaluates generated audio detection systems with additional augmentation-source robustness metrics.
SAFE Audio Challenge Task 3 evaluates generated audio detection systems on the third public SAFE challenge task split.
SciPredict benchmarks LLMs on forecasting the outcomes of real scientific experiments across biology, chemistry, and physics.
GUI grounding benchmark for professional high-resolution computer-use settings across development, creative, CAD, scientific, office, and operating-system applications.
Standardized leaderboard for search-augmented question-answering agents across general QA, multi-hop QA, and the closed-world FictionalHot benchmark.
SEED-Bench: Evaluates multimodal understanding across image, text, chart, diagram, or cross-modal reasoning tasks.
SEED-Bench-2: Evaluates multimodal understanding across image, text, chart, diagram, or cross-modal reasoning tasks.
Embedding retrieval leaderboard for rabbinic and Jewish textual bitext alignment across Talmud, Jerusalem Talmud, Mishnah, Midrash Rabbah, Tanakh commentary, Hasidic/Kabbalistic texts, Halacha, Philosophy, Targum, and Mussar/Ethics.
Seneca-TRBench evaluates LLM Turkish language proficiency with MCQ structural-linguistics questions and GPT-4o-judged short-answer tasks.
Multiple-choice benchmark of simple-looking reasoning questions designed so unspecialized humans outperform current frontier models.
SkillsBench evaluates coding agents on self-contained programming tasks, measuring practical engineering skills across diverse software development scenarios.
A semantically-labeled knowledge-enhanced dataset for medical visual question answering. Contains 642 radiology images (CT scans, MRI scans, X-rays) covering five body parts and 14,028 bilingual English-Chinese question-answer pairs annotated by experienced physicians. Features comprehensive semantic labels and a structural medical knowledge base with both vision-only and knowledge-based questions requiring external medical knowledge reasoning.
The first large-scale benchmark for commonsense reasoning about social situations. Contains 38,000 multiple choice questions probing emotional and social intelligence in everyday situations, testing commonsense understanding of social interactions and theory of mind reasoning about the implied emotions and behavior of others.
Crowd-sourced blind listening leaderboard for text-to-speech systems, ranking voices by pairwise human preference outcomes.
SQuALITY (Summarization-format QUestion Answering with Long Input Texts, Yes!) is a long-document summarization dataset built by hiring highly-qualified contractors to read public-domain short stories (3000-6000 words) and write original summaries from scratch. Each document has five summaries: one overview and four question-focused summaries. Designed to address limitations in existing summarization datasets by providing high-quality, faithful summaries.
StableToolBench evaluates LLM tool-use systems on solvable tool-query tasks, reporting pass-rate and win-rate scores across instruction, category, and tool subsets.
Leaderboard benchmarking LLM stability in simulated populations and roleplay settings, with ordinal, cardinal, rank-order stability, and structural fit metrics.
SOB evaluates how accurately language models produce schema-compliant and value-correct JSON from normalized text contexts spanning text QA, OCR-derived documents, and meeting transcripts.
SummScreenFD is the ForeverDreaming subset of the SummScreen dataset for abstractive screenplay summarization, comprising pairs of TV series transcripts and human-written recaps from 88 different shows. The dataset provides a challenging testbed for abstractive summarization where plot details are often expressed indirectly in character dialogues and scattered across the entirety of the transcript, requiring models to find and integrate these details to form succinct plot descriptions.
SWE Atlas Codebase QnA evaluates LLMs on deep code comprehension and question answering across real-world software repositories.
SWE Atlas Refactoring evaluates coding agents on restructuring code while preserving behavior across real-world software repositories.
SWE Atlas Test Writing evaluates coding agents on writing production-grade tests for specific behaviors in real-world software repositories.
Scale AI's professional software engineering benchmark extending SWE-bench-style issue resolution tasks.
Bash-only variant of SWE-bench Verified for real-world GitHub issue resolution.
Continuously evolving, decontaminated software engineering benchmark built from real GitHub pull requests for evaluating coding agents.
t2-bench is a benchmark for evaluating agentic tool use capabilities, measuring how well models can select, sequence, and utilize tools to solve complex tasks. It tests autonomous planning and execution in multi-step scenarios.
TabArena ranks tabular machine learning systems across all datasets and tasks; this snapshot uses the primary no-imputation, all-repeats, all-tasks leaderboard view.
TAU2 airline domain benchmark for evaluating conversational agents in dual-control environments where both AI agents and users interact with tools in airline customer service scenarios. Tests agent coordination, communication, and ability to guide user actions in tasks like flight booking, modifications, cancellations, and refunds.
TAU2-bench customer-service benchmark variant for retail workflows, measuring agents across multi-turn tool-use tasks.
TAU3-Bench is a benchmark for evaluating general-purpose agent capabilities, testing models on multi-turn interactions with simulated user models, retrieval, and complex decision-making scenarios.
TempCompass is a comprehensive benchmark for evaluating temporal perception capabilities of Video Large Language Models (Video LLMs). It constructs conflicting videos that share identical static content but differ in specific temporal aspects to prevent models from exploiting single-frame bias. The benchmark evaluates multiple temporal aspects including action, motion, speed, temporal order, and attribute changes across diverse task formats including multi-choice QA, yes/no QA, caption matching, and caption generation.
TemporalBench evaluates LLM-based agents on contextual and event-informed time-series tasks spanning multiple datasets and task types.
Terminal and command-line interaction tasks for evaluating agent performance.
TextClass Benchmark evaluates LLMs and transformers for social-science text classification across multiple domains and languages, reporting domain-specific Elo leaderboards and a weighted Meta-Elo aggregate.
Multi-step workplace automation benchmark for autonomous agents.
A theorem-driven question answering dataset containing 800 high-quality questions covering 350+ theorems from Math, Physics, EE&CS, and Finance. Designed to evaluate AI models' capabilities to apply theorems to solve challenging university-level science problems.
The timm leaderboard ranks image classification models across ImageNet and robustness variants including ImageNet-ReaL, ImageNetV2, ImageNet-Sketch, and ImageNet-R.
TOFU evaluates machine unlearning for large language models on fictitious author QA data; this leaderboard variant reports LLaMA submissions at the 1% forget-set setting.
TOFU evaluates machine unlearning for large language models on fictitious author QA data; this leaderboard variant reports LLaMA submissions at the 10% forget-set setting.
TOFU evaluates machine unlearning for large language models on fictitious author QA data; this leaderboard variant reports LLaMA submissions at the 5% forget-set setting.
TOFU evaluates machine unlearning for large language models on fictitious author QA data; this leaderboard variant reports Phi submissions at the 1% forget-set setting.
TOFU evaluates machine unlearning for large language models on fictitious author QA data; this leaderboard variant reports Phi submissions at the 10% forget-set setting.
TOFU evaluates machine unlearning for large language models on fictitious author QA data; this leaderboard variant reports Phi submissions at the 5% forget-set setting.
Tool Decathlon is a comprehensive benchmark for evaluating AI agents' ability to use multiple tools across diverse task categories. It measures proficiency in tool selection, sequencing, and execution across ten different tool-use scenarios.
COMET-22 is an ensemble machine translation evaluation metric combining a COMET estimator model trained with Direct Assessments and a multitask model that predicts sentence-level scores and word-level OK/BAD tags. It demonstrates improved correlations compared to state-of-the-art metrics and increased robustness to critical errors.
Translation evaluation using spBLEU (SentencePiece BLEU), a BLEU metric computed over text tokenized with a language-agnostic SentencePiece subword model. Introduced in the FLORES-101 evaluation benchmark for low-resource and multilingual machine translation.
COMET-22 is a neural machine translation evaluation metric that uses an ensemble of two models: a COMET estimator trained with Direct Assessments and a multitask model that predicts sentence-level scores and word-level OK/BAD tags. It provides improved correlations with human judgments and increased robustness to critical errors compared to previous metrics.
spBLEU (SentencePiece BLEU) evaluation metric for machine translation quality assessment, using language-agnostic SentencePiece tokenization with BLEU scoring. Part of the FLORES-101 evaluation benchmark for low-resource and multilingual machine translation.
A large-scale reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents (six per question on average) that provide high quality distant supervision for answering the questions. The dataset features relatively complex, compositional questions with considerable syntactic and lexical variability, requiring cross-sentence reasoning to find answers.
TruthfulQA is a benchmark to measure whether language models are truthful in generating answers to questions. It comprises 817 questions that span 38 categories, including health, law, finance and politics. The questions are crafted such that some humans would answer falsely due to a false belief or misconception, testing models' ability to avoid generating false answers learned from human texts.
TuRTLe leaderboard variant for RTL code completion evaluated with Icarus Verilog across VerilogEval MC and VeriGen.
TuRTLe leaderboard variant for RTL code completion evaluated with Verilator across VerilogEval MC and VeriGen.
TuRTLe line-completion leaderboard on RTL-Repo using exact match as the primary metric.
TuRTLe module-completion leaderboard on the NotSoTiny benchmark, using formal equivalence and partition coverage over Tiny Tapeout-derived RTL modules.
TuRTLe leaderboard variant for spec-to-RTL generation evaluated with Icarus Verilog across VerilogEval S2R and RTLLM.
TuRTLe leaderboard variant for spec-to-RTL generation evaluated with Verilator across VerilogEval S2R and RTLLM.
TutorBench evaluates how well LLMs perform common tutoring tasks for high school and AP-level subjects.
Physically grounded benchmark for autonomous and agentic AI UAV systems, with 50,000 validated flight scenarios and 50,000 multiple-choice UAV reasoning questions spanning navigation, safety, policy, cyber-physical security, ethics, energy, and hybrid reasoning.
Leaderboard for Ukrainian-language LLM evaluation across translated and native tasks including MMLU, FLORES, SQuAD, ARC, GSM8K, IFEval, WMT, and ZNO.
Unified semantic evaluation benchmark for text-to-image generation across style, world knowledge, attributes, actions, relationships, compound prompts, grammar, layout, reasoning, and text.
Long-prompt English variant of UniGenBench for semantic evaluation of text-to-image generation across style, world knowledge, attributes, actions, relationships, grammar, layout, reasoning, and text.
Unified reasoning-based image editing benchmark covering real-world edits and game-world reasoning tasks.
URIAL Bench evaluates base language models prompted with Untuned LLMs with Restyled In-context ALignment on MT-Bench-style multi-turn tasks.
The 2025 United States of America Mathematical Olympiad (USAMO) benchmark consists of six challenging mathematical problems requiring rigorous proof-based reasoning. USAMO is the most prestigious high school mathematics competition in the United States, serving as the final round of the American Mathematics Competitions series. This benchmark evaluates models on mathematical problem-solving capabilities beyond simple numerical computation, focusing on formal mathematical reasoning and proof generation.
VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. Contains over 41,250 videos and 825,000 captions in both English and Chinese, with over 206,000 English-Chinese parallel translation pairs. Supports multilingual video captioning and video-guided machine translation tasks.
VBVR-Bench evaluates video generation models on visual behavior and video reasoning across in-domain and out-of-domain abstraction, knowledge, perception, spatial, and transition categories.
Leaderboard using Vectara's Hughes Hallucination Evaluation Model to measure hallucination and factual consistency in document summarization.
Open coding-agent benchmark harness comparing agent resolution rate, cost, and unique wins on a curated 100-task subset of SWE-bench Verified.
VIBE-Eval is a hard evaluation suite for measuring progress of multimodal language models, consisting of 269 visual understanding prompts with gold-standard responses authored by experts. The benchmark has dual objectives: vibe checking multimodal chat models for day-to-day tasks and rigorously testing frontier models, with the hard set containing >50% questions that all frontier models answer incorrectly.
VIBE-Pro is an advanced version of the VIBE (Visual & Interactive Benchmark for Execution) benchmark that evaluates LLMs on professional-grade full-stack application development tasks. It measures model performance across complex real-world development scenarios including web, mobile, and backend applications with higher difficulty than the standard VIBE benchmark.
Production-oriented coding benchmark evaluating AI coding agents across functional correctness, visual fidelity, code quality, security, cost, and speed on representative developer tasks.
Video SimpleQA evaluates factual grounding in large video language models with short-form, multi-hop, temporally grounded video questions.
Video understanding benchmark with 800 videos and 3,200 multiple-choice QA items spanning retrieval, temporal understanding, and complex reasoning.
The first-ever comprehensive evaluation benchmark of Multi-modal LLMs in Video analysis. Features 900 videos (254 hours) with 2,700 question-answer pairs covering 6 primary visual domains and 30 subfields. Evaluates temporal understanding across short (11 seconds) to long (1 hour) videos with multi-modal inputs including video frames, subtitles, and audio.
Video-MME is a comprehensive evaluation benchmark for multi-modal large language models in video analysis. It features 900 videos across 6 primary visual domains with 30 subfields, ranging from 11 seconds to 1 hour in duration, with 2,700 question-answer pairs. The benchmark evaluates MLLMs' capabilities in processing sequential visual data and multi-modal content including video frames, subtitles, and audio.
Video-MMMU evaluates Large Multimodal Models' ability to acquire knowledge from expert-level professional videos across six disciplines through three cognitive stages: perception, comprehension, and adaptation. Contains 300 videos and 900 human-annotated questions spanning Art, Business, Science, Medicine, Humanities, and Engineering.
Visual Document Retrieval Benchmark V3 pipeline leaderboard for English-only full retrieval systems, including NDCG@5 and latency metrics.
VisIT-Bench Multiple Images ranks vision-language models with human-preference Elo scores on instruction-following tasks over multiple images.
VisIT-Bench Single Image ranks vision-language models with human-preference Elo scores on instruction-following tasks over single images.
Scale’s SEAL Leaderboard evaluates top models’ visual-language understanding, testing perception, logic, calculation, and common sense.
BrowserGym leaderboard slice for VisualWebArena, evaluating web agents on visually grounded browser tasks under the BrowserGym submission protocol.
A multimodal benchmark designed to assess the capabilities of multimodal large language models (MLLMs) across web page understanding and grounding tasks. Comprises 7 tasks (captioning, webpage QA, heading OCR, element OCR, element grounding, action prediction, and action grounding) with 1.5K human-curated instances from 139 real websites across 87 sub-domains.
A vision-language benchmark that probes blind spots and brittle reasoning in multimodal models.
Leaderboard for Japanese visual-novel translation into English, ranking LLMs and translation systems by semantic similarity accuracy over 256 translation samples, with chrF reported as an auxiliary metric.
Open real-world latency leaderboard for voice AI platforms, ranking providers by measured end-to-end response latency from recurring automated calls.
Visual Similarity Dataset fashion retrieval benchmark for in-catalog zero-shot retrieval, reporting ROC AUC and MRR@5 for paper baseline models.
BrowserGym leaderboard slice for WebLINX, evaluating web agents under the BrowserGym result submission protocol.
Human-annotated web main-content extraction benchmark evaluating extractors and model-backed pipelines on full-page ROUGE-N F1 plus fine-grained text, code, formula, and table metrics.
WHOOPS evaluates vision-language models on commonsense-defying images; this leaderboard snapshot tracks the explanation-of-violation human metric from the public WHOOPS full leaderboard.
WideSearch is an agentic search benchmark that evaluates models' ability to perform broad, parallel search operations across multiple sources. It tests wide-coverage information retrieval and synthesis capabilities.
End-to-end AI agent benchmark with 60 original tasks in a live OpenClaw environment spanning productivity, code intelligence, social interaction, search, creative synthesis, and safety alignment workflows.
Large-scale Winograd schema challenge for commonsense pronoun-resolution reasoning.
The Eighth Conference on Machine Translation (WMT23) benchmark evaluating machine translation systems across 8 language pairs (14 translation directions) including general, biomedical, literary, and low-resource language translation tasks. Features specialized shared tasks for quality estimation, metrics evaluation, sign language translation, and discourse-level literary translation with professional human assessment.
WMT24++ is a comprehensive multilingual machine translation benchmark that expands the WMT24 dataset to cover 55 languages and dialects. It includes human-written references and post-edits across four domains (literary, news, social, and speech) to evaluate machine translation systems and large language models across diverse linguistic contexts.
BrowserGym leaderboard slice for WorkArena-L1, evaluating web agents on atomic ServiceNow knowledge-work tasks.
BrowserGym leaderboard slice for WorkArena-L2, evaluating web agents on compositional ServiceNow knowledge-work tasks.
BrowserGym leaderboard slice for WorkArena-L3, evaluating web agents on harder compositional ServiceNow knowledge-work tasks.
WorldArena leaderboard track for embodied world models, aggregating perception, dynamics, consistency, physics, 3D accuracy, and controllability metrics.
WorldArena leaderboard track for embodied world models on data-engine and action-planner functional utility tasks.
Leaderboard for WorldScore, a benchmark for world generation systems across video, 3D, and 4D settings with static, dynamic, camera/object control, consistency, style, subjective quality, and motion metrics.
A comprehensive benchmark for evaluating large language models' generative writing capabilities across 6 core writing domains (Academic & Engineering, Finance & Business, Politics & Law, Literature & Art, Education, Advertising & Marketing) and 100 subdomains. Contains 1,239 queries with a query-dependent evaluation framework that dynamically generates 5 instance-specific assessment criteria for each writing task, using a fine-tuned critic model to score responses on style, format, and length dimensions.
Extreme-risks evaluation leaderboard for frontier models, using 3C3H scoring over biology, chemistry, and cybersecurity risk-domain questions.
XSTest is a test suite designed to identify exaggerated safety behaviours in large language models. It comprises 450 prompts: 250 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with, and 200 unsafe prompts as contrasts that models should refuse. The benchmark systematically evaluates whether models refuse to respond to clearly safe prompts due to overly cautious safety mechanisms.
Yet Another LLM Leaderboard snapshot for the Nous benchmark suite, aggregating public AGIEval, GPT4All, TruthfulQA, and Bigbench scores for open LLMs.
ZClawBench evaluates Claw-style agent task execution quality, measuring a model's ability to autonomously complete complex multi-step coding tasks in real-world environments.
ZebraLogic evaluates models on grid-style zebra logic puzzles, reporting exact puzzle accuracy and cell-level accuracy across difficulty and puzzle sizes.
ZEROBench-Sub is a subset of the ZEROBench benchmark.
BigCodeBench-Hard evaluates code generation on the harder BigCodeBench subset, reporting pass@1 in complete and instruct settings.
BrowseComp: Evaluates autonomous agent performance on multi-step tasks requiring planning, state tracking, tool use, and recovery.
A stateful enterprise operations benchmark for evaluating LLM agents on long-horizon planning, tool use, and policy-governed workflows.
Google DeepMind and Google Research benchmark for long-form factuality and grounding against provided document context up to 32k tokens.
MATH Level 5: Measures mathematical reasoning, symbolic problem solving, proof construction, or competition-style problem solving.
A benchmark where software engineering agents rebuild complete programs from compiled binaries and documentation, then are scored against hidden behavioral tests.
VisualWebArena: Measures browser, desktop, mobile, or GUI agents operating in interactive environments.
Private question-answer benchmark over Canadian court-cases.
A benchmark of expert-validated tasks for agents that learn and improve across sequences of task instances rather than solving independent tasks from scratch.
Evaluating agents on core financial analyst tasks
Anthropic BioMysteryBench slice covering 23 real-world bioinformatics tasks no human benchmarker solved after QC, evaluated by average accuracy across five trials per problem.
Anthropic BioMysteryBench slice covering 76 real-world bioinformatics tasks solved by at least one human benchmarker, evaluated by average accuracy across five trials per problem.
Speech-to-text benchmark for long-form medical dialogue, ranking cloud and local transcription systems with Medical Word Error Rate on the PriMock57 dataset.
Data-science agent benchmark evaluating whether LLM agents solve real-data analysis tasks correctly and robustly across correctness, code quality, efficiency, and statistical validity.
OpenAI internal expansion of hard cybersecurity capture-the-flag challenge tasks used in system cards.
OpenAI internal frontier software engineering evaluation for long-horizon coding tasks with a median estimated human completion time of 20 hours.
Graphwalks breadth-first-search long-context reasoning task reported at 1M context with F1 scoring.
Graphwalks breadth-first-search long-context reasoning task reported at 256k context with F1 scoring.
Graphwalks parent-node long-context reasoning task reported at 1M context with F1 scoring.
Graphwalks parent-node long-context reasoning task reported at 256k context with F1 scoring.
OpenAI internal evaluation for investment-banking modeling tasks.
OpenAI launch-post benchmark for professional office question-answering tasks.
OpenAI MRCR v2 8-needle long-context retrieval variant for the 128K-256K context range.
OpenAI MRCR v2 8-needle long-context retrieval variant for the 16K-32K context range.
OpenAI MRCR v2 8-needle long-context retrieval variant for the 256K-512K context range.
OpenAI MRCR v2 8-needle long-context retrieval variant for the 32K-64K context range.
OpenAI MRCR v2 8-needle long-context retrieval variant for the 4K-8K context range.
OpenAI MRCR v2 8-needle long-context retrieval variant for the 512K-1M context range.
OpenAI MRCR v2 8-needle long-context retrieval variant for the 64K-128K context range.
OpenAI MRCR v2 8-needle long-context retrieval variant for the 8K-16K context range.
UC Berkeley EPIC Data Lab benchmark for data agents answering complex real-world data tasks across 12 datasets, 9 domains, and multiple database systems.
Safety-first clinical SOAP note generation benchmark measuring groundedness, hallucinations, coverage, and note quality across 300 doctor-patient dialogues.
Telecommunications-domain model leaderboard across TeleQnA, TeleTables, ORANBench, srsRANBench, TeleMath, TeleLogs, and 3GPP pattern-matching benchmarks.
SWE-bench extension with 300 software issue tasks spanning 9 programming languages.
Which model can make the most money playing poker?
Original SWE-bench leaderboard over 2,294 real GitHub issue resolution tasks.
Function-calling edition of AgentBench, evaluating LLM agents on ALFWorld, database, knowledge graph, operating-system, and WebShop environments using pass@1 success rates.
SWE-bench extension with 517 software issues that include visual elements such as screenshots, mockups, diagrams, and visual error context.
Curated 300-instance SWE-bench subset for lower-cost evaluation of issue-resolving agents.
Real freelance software engineering tasks from Upwork, scored by end-to-end tests and payout value.
Benchmark suite for long-context software engineering tasks including library-based code generation, CI build repair, commit message generation, bug localization, and module summarization.
End-to-end replication of state-of-the-art AI papers, graded against hierarchical rubrics.
Human annotated subset of SWE-bench with 500 verified software engineering tasks.
Hugging Face Optimum Benchmark performance leaderboard for LLM inference configurations across PyTorch CUDA, CPU, OpenVINO, ONNX Runtime, quantization schemes, and hardware profiles.
Dynamic reasoning benchmark grounded in computational complexity classes, evaluating LLMs on P, NP-complete, and NP-hard algorithmic tasks with weighted accuracy.
Showing 16 of 1,002 benchmarks