Benchmark landscape
Benchmark Market Map
AI benchmarks grouped by the ability they test and the rough task shape they use.
- Benchmarks
- 1,002
- Known tasks
- 311,714
Ability
Coding Agents
Software engineering tasks where systems inspect repositories, edit code, run tools, and satisfy verifiers.
Issue Resolution
22Bug fixes, PR-style tasks, SWE-bench-style repair.
No mapped benchmarks yet.
Repo Context
42Repository understanding, codebase QA, retrieval, and localization.
No mapped benchmarks yet.
Project Building
8Full apps, larger feature builds, and end-to-end implementation.
No mapped benchmarks yet.
Terminal / DevOps
6Command-line, shell, container, and operational coding tasks.
No mapped benchmarks yet.
Kernel / Perf
11GPU kernels, optimization, and low-level performance work.
No mapped benchmarks yet.
Code Generation
52Programming contests, synthesis, completion, and standalone code tasks.
No mapped benchmarks yet.
Ability
Web + Computer Use
Agents operating browsers, GUIs, devices, apps, or enterprise software surfaces.
Browser / Web
31Website navigation, web workflows, and browser control.
No mapped benchmarks yet.
Desktop / OS
21Desktop environments, operating systems, and GUI apps.
No mapped benchmarks yet.
Enterprise Apps
4CRM, office, airline, retail, telecom, and workplace apps.
No mapped benchmarks yet.
Automation
10Tool-driven app automation and workflow execution.
No mapped benchmarks yet.
Visual UI
78Screen grounding, app UI understanding, and visual web tasks.
No mapped benchmarks yet.
Ability
Long Context + Memory
Benchmarks stressing retrieval, context windows, evidence use, temporal state, and durable agent memory.
Document Retrieval
59RAG and document-grounded answering.
No mapped benchmarks yet.
Trajectory Memory
6Remembering long agent sessions, traces, and user histories.
No mapped benchmarks yet.
Long-Context QA
15Needles, long documents, books, and massive-context question answering.
No mapped benchmarks yet.
Temporal State
6Time, chronology, state tracking, and evolving facts.
No mapped benchmarks yet.
Knowledge Recall
0Open-ended recall and grounded factual memory.
No mapped benchmarks yet.
Ability
Professional Workflows
Domain work where correctness depends on professional conventions, documents, and expert judgment.
Legal + Patents
13Legal reasoning, litigation, contracts, patents, and regulation.
No mapped benchmarks yet.
Finance + Business
27Financial analysis, spreadsheets, investment, tax, and business documents.
No mapped benchmarks yet.
Medical + Health
28Clinical, biomedical, health, and care workflow tasks.
No mapped benchmarks yet.
Enterprise Ops
20Operational business workflows and internal process execution.
No mapped benchmarks yet.
Research + Science
41Scientific work, literature, experiments, and research assistance.
No mapped benchmarks yet.
Ability
Multimodal Understanding
Visual, audio, video, spatial, and physical-world perception and reasoning.
OCR + Docs
27Document AI, OCR, forms, PDFs, and screenshots.
No mapped benchmarks yet.
Charts + Tables
12Charts, plots, tables, and structured visual data.
No mapped benchmarks yet.
Image + Spatial
29Images, visual QA, localization, and spatial reasoning.
No mapped benchmarks yet.
Video + Audio
30Video, speech, sound, and audiovisual understanding.
No mapped benchmarks yet.
Robotics / Physical
4Embodied, robotics, manipulation, and physical-world tasks.
No mapped benchmarks yet.
Ability
Safety, Security + Trust
Adversarial behavior, misuse, cyber, privacy, robustness, hallucination, and trustworthiness.
Jailbreaks / Misuse
60Harmful requests, jailbreaks, policy violations, and refusal behavior.
No mapped benchmarks yet.
Cyber Security
17CTFs, exploits, secure coding, vulnerabilities, and CWE coverage.
No mapped benchmarks yet.
Privacy + PII
2PII detection, masking, leakage, and privacy preservation.
No mapped benchmarks yet.
Hallucination + Truth
43Factuality, grounding, hallucination, and truthfulness.
No mapped benchmarks yet.
Governance + Fairness
13Bias, fairness, compliance, trust, and alignment checks.
No mapped benchmarks yet.
Ability
Tools, Data + Structured Work
Using APIs, databases, tools, spreadsheets, schemas, and structured outputs.
SQL + Data
9SQL, databases, analytics, tables, and data agent tasks.
No mapped benchmarks yet.
Function Calling
59API use, tool calling, and structured tool selection.
No mapped benchmarks yet.
Search + RAG
4Search, browsing, evidence retrieval, and answer synthesis.
No mapped benchmarks yet.
Docs + Sheets
1Documents, spreadsheets, forms, and office-style structured work.
No mapped benchmarks yet.
MCP / APIs
4MCP servers, API ecosystems, and external tool environments.
No mapped benchmarks yet.
Ability
Reasoning + Knowledge
Core model capability tests across math, science, knowledge, language, logic, and instruction following.
Math
30Math competitions, arithmetic, proofs, and quantitative reasoning.
No mapped benchmarks yet.
Science
53Science QA, physics, biology, chemistry, and technical exams.
No mapped benchmarks yet.
Exams + Knowledge
65General exams, knowledge QA, and broad capability tests.
No mapped benchmarks yet.
Logic + Planning
24Puzzles, planning, symbolic reasoning, and hard reasoning.
No mapped benchmarks yet.
Language + Multilingual
16Language understanding, translation, multilingual, and writing tasks.
No mapped benchmarks yet.
No benchmarks match that search.