GPT-4
GPT / OpenAI
141scores
81benchmarks
$30 / $60 per 1M tokenscost in/out
Metadata
GPT Closed/API
Aliases: gpt-4, openai-gpt-4, openai/gpt-4
| Benchmark | Category | Rank | Score | Sampled |
|---|---|---|---|---|
| Clembench Multimodal v1.6.5 | Agentic | 3 | 73.55 | 2026-05-06 |
| MLAgentBench | Agentic | 4 | 19.2% | 2026-05-27 |
| Nexus Function Calling | Agentic | 2 | 54.18 | 2026-05-06 |
| OmniACT | Agentic | 2 | 17.02 | 2026-05-27 |
| OmniACT | Agentic | 3 | 11.6 | 2026-05-27 |
| ScreenSpot | Agentic | 4 | 16.2% | 2026-05-27 |
| ToolSandbox | Agentic | 4 | 64.3 | 2026-05-27 |
| RewardBench | Alignment | 51 | 84.34 | 2026-05-06 |
| TextClass Benchmark | Classification | 10 | 1747.59 | 2026-05-06 |
| Aider Refactoring Benchmark | Coding | 6 | 50.60 | 2026-05-06 |
| Aider Refactoring Benchmark | Coding | 11 | 33.70 | 2026-05-06 |
| BigCodeBench | Coding | 22 | 46 | 2026-05-06 |
| ClassEval | Coding | 1 | 37.6 | 2026-05-27 |
| ClassEval | Coding | 3 | 29.6 | 2026-05-27 |
| ClassEval | Coding | 4 | 26.2 | 2026-05-27 |
| CodeEditorBench | Coding | 1 | 0.882 | 2026-05-27 |
| CodeEditorBench | Coding | 2 | 0.868 | 2026-05-27 |
| CodeEditorBench | Coding | 3 | 0.855 | 2026-05-27 |
| CodeEditorBench | Coding | 5 | 0.85 | 2026-05-27 |
| CodeEditorBench | Coding | 6 | 0.816 | 2026-05-27 |
| CodeEditorBench | Coding | 10 | 0.8 | 2026-05-27 |
| CRUXEval | Coding | 3 | 76.30 | 2026-05-05 |
| CRUXEval | Coding | 5 | 69.25 | 2026-05-05 |
| DS-1000 | Coding | 2 | 0.51 | 2026-05-27 |
| ENAMEL | Coding | 4 | 0.45 | 2026-05-06 |
| HumanEval+ | Coding | 14 | 79.30 | 2026-05-05 |
| Spider | Data | 2 | 86.60 | 2026-05-06 |
| Spider | Data | 3 | 86.20 | 2026-05-06 |
| Spider | Data | 4 | 85.60 | 2026-05-06 |
| Spider | Data | 6 | 83.90 | 2026-05-06 |
| Spider | Data | 8 | 80.80 | 2026-05-06 |
| MMDocBench | Document Understanding | 8 | 61.93% | 2026-05-27 |
| GSMA Open Telco Leaderboard | Domain | 48 | 48.58 | 2026-05-06 |
| FinanceBench | Finance | 1 | 89.33 | 2026-05-06 |
| FinanceBench | Finance | 2 | 85.33 | 2026-05-06 |
| FinanceBench | Finance | 3 | 84 | 2026-05-06 |
| FinanceBench | Finance | 4 | 78.67 | 2026-05-06 |
| FinanceBench | Finance | 5 | 78.67 | 2026-05-06 |
| FinanceBench | Finance | 7 | 50 | 2026-05-06 |
| FinanceBench | Finance | 8 | 42 | 2026-05-06 |
| FinanceBench | Finance | 11 | 24.67 | 2026-05-06 |
| FinanceBench | Finance | 12 | 19.33 | 2026-05-06 |
| FinanceBench | Finance | 14 | 16.67 | 2026-05-06 |
| FinanceBench | Finance | 15 | 9.33 | 2026-05-06 |
| FinanceBench | Finance | 16 | 4.67 | 2026-05-06 |
| FinBen | Finance | 1 | 28.19% | 2026-05-27 |
| INVESTORBENCH | Finance | 2 | 43.696% | 2026-05-27 |
| Open FinLLM Leaderboard | Finance | 2 | 48.337138% | 2026-05-27 |
| AlpacaEval | Generalization | 7 | 89.85849210429464 | 2026-05-27 |
| AlpacaEval | Generalization | 16 | 86.51018625518144 | 2026-05-27 |
| AlpacaEval | Generalization | 19 | 85.334647371383 | 2026-05-27 |
| AlpacaEval | Generalization | 31 | 81.38159399734118 | 2026-05-27 |
| AlpacaEval | Generalization | 90 | 44.09937888 | 2026-05-27 |
| CyberBench | Generalization | 1 | 69.6% | 2026-05-28 |
| CyberSecEval | Generalization | 2 | 19.87% | 2026-05-27 |
| EQ-Bench | Generalization | 3 | 84.79 | 2026-05-06 |
| FreshQA | Generalization | 1 | 46.4% | 2026-05-27 |
| HELM AIR-Bench | Generalization | 49 | 0.641728 | 2026-05-28 |
| InfiniteBench | Generalization | 1 | 46.099167% | 2026-05-27 |
| L-Eval | Generalization | 1 | 73.111667% | 2026-05-27 |
| MoralChoice | Generalization | 4 | 1 | 2026-05-27 |
| MT-Bench | Generalization | 1 | 8.990625 | 2026-05-27 |
| MT-Bench | Generalization | 22 | 5.4125 | 2026-05-27 |
| WildBench | Generalization | 11 | 7.6640625 | 2026-05-27 |
| AgentClinic | Healthcare | 2 | 51.6% | 2026-05-27 |
| MMLU Medical Genetics | Healthcare | 2 | 91.0% | 2026-05-27 |
| MMLU Professional Medicine | Healthcare | 2 | 93.01% | 2026-05-27 |
| MultiMedQA | Healthcare | 2 | 81.134167% | 2026-05-27 |
| HREF | Instruction Following | 26 | 6.12 | 2026-05-06 |
| RubricEval | Instruction Following | 1 | 3.18 | 2026-05-06 |
| URIAL Bench | Instruction Following | 1 | 8.99 | 2026-05-06 |
| AIR-Bench | Intelligence | 3 | 53.5889 | 2026-05-27 |
| Artificial Analysis Intelligence Index | Intelligence | 371 | 12.75 | 2026-05-11 |
| C-Eval | Intelligence | 67 | 68.7% | 2026-05-27 |
| ChartBench | Intelligence | 2 | 54.39 | 2026-05-06 |
| Gaokao-Bench | Intelligence | 1 | 72.2% | 2026-05-27 |
| Gaokao-Bench | Intelligence | 2 | 71.6% | 2026-05-27 |
| HELM Instruct | Intelligence | 3 | 0.611111 | 2026-05-28 |
| HELM Lite | Intelligence | 3 | 0.908908 | 2026-05-28 |
| MathVision | Intelligence | 124 | 23.98 | 2026-05-06 |
| MathVision | Intelligence | 128 | 22.76 | 2026-05-06 |
| MathVision | Intelligence | 152 | 13.10 | 2026-05-06 |
| MathVista | Intelligence | 31 | 58.10 | 2026-05-06 |
| MathVista | Intelligence | 38 | 49.90 | 2026-05-06 |
| MathVista | Intelligence | 61 | 33.90 | 2026-05-06 |
| MathVista | Intelligence | 64 | 33.20 | 2026-05-06 |
| MMBench-CN | Intelligence | 3 | 73.3 | 2026-05-27 |
| MMStar | Intelligence | 1 | 57.10 | 2026-05-06 |
| MMStar | Intelligence | 4 | 46.10 | 2026-05-06 |
| MVBench | Intelligence | 3 | 43.5 | 2026-05-27 |
| OCRBench | Intelligence | 12 | 645 | 2026-05-06 |
| SEED-Bench | Intelligence | 4 | 67.30 | 2026-05-06 |
| SEED-Bench-2 | Intelligence | 4 | 69.80 | 2026-05-06 |
| VCR | Intelligence | 2 | 81.6% | 2026-05-27 |
| Open Ko-LLM Leaderboard | Language | 296 | 40.27 | 2026-05-06 |
| Open Ko-LLM Leaderboard | Language | 344 | 39.38 | 2026-05-06 |
| LawBench | Legal | 2 | 53.8453 | 2026-05-27 |
| LawBench | Legal | 3 | 52.3521 | 2026-05-27 |
| JEEBench | Math | 1 | 0.389 | 2026-05-27 |
| JEEBench | Math | 2 | 0.350 | 2026-05-27 |
| JEEBench | Math | 3 | 0.339 | 2026-05-27 |
| JEEBench | Math | 4 | 0.309 | 2026-05-27 |
| LeanDojo Benchmark | Math | 3 | 7.4% | 2026-05-27 |
| OlympiadBench | Math | 2 | 17.97 | 2026-05-06 |
| OlympiadBench | Math | 2 | 29.93 | 2026-05-06 |
| OlympiadBench | Math | 3 | 29.07 | 2026-05-06 |
| Open Medical-LLM Leaderboard | Medical | 4 | 82.97 | 2026-05-06 |
| ReXrank | Medical | 115 | 0.708 | 2026-05-27 |
| ReXrank | Medical | 123 | 0.683 | 2026-05-27 |
| ReXrank | Medical | 136 | 0.629 | 2026-05-27 |
| ReXrank | Medical | 142 | 0.605 | 2026-05-27 |
| ReXrank | Medical | 148 | 0.568 | 2026-05-27 |
| ReXrank | Medical | 149 | 0.558 | 2026-05-27 |
| ReXrank | Medical | 152 | 0.549 | 2026-05-27 |
| ReXrank | Medical | 164 | 0.431 | 2026-05-27 |
| BenchBench | Meta | 27 | 0.76 | 2026-05-06 |
| AutoEval-Video | Multimodal | 1 | 22.20 | 2026-05-06 |
| MMAU | Multimodal | 21 | 51.03 | 2026-05-06 |
| ScienceQA | Multimodal | 8 | 92.53 | 2026-05-06 |
| ScienceQA | Multimodal | 26 | 86.54 | 2026-05-06 |
| ScienceQA | Multimodal | 33 | 83.99 | 2026-05-06 |
| Video-MME | Multimodal | 31 | 63.30 | 2026-05-06 |
| DROP | Reasoning | 10 | 0.81 | 2026-05-06 |
| YALL Nous Leaderboard | Reasoning | 144 | 45.66 | 2026-05-06 |
| ChatRAG Bench | Retrieval | 5 | 53.90 | 2026-05-06 |
| ChemBench | Science | 47 | 0.41 | 2026-05-06 |
| SWT-Bench | Software Engineering | 20 | 18.5% | 2026-05-27 |
| SWT-Bench | Software Engineering | 23 | 15.9% | 2026-05-27 |
| SWT-Bench | Software Engineering | 25 | 14.1% | 2026-05-27 |
| SWT-Bench | Software Engineering | 26 | 12.7% | 2026-05-27 |
| SWT-Bench | Software Engineering | 29 | 9.4% | 2026-05-27 |
| SWT-Bench | Software Engineering | 30 | 9.1% | 2026-05-27 |
| SWT-Bench | Software Engineering | 31 | 3.6% | 2026-05-27 |
| AudioMC | Speech | 11 | 14.82 | 2026-05-07 |
| AudioMC | Speech | 14 | 13.05 | 2026-05-07 |
| AudioMC - Audio Output | Speech | 5 | 13.05 | 2026-05-07 |
| AudioMC - Text Output | Speech | 9 | 14.82 | 2026-05-06 |
| VoiceBench | Speech | 7 | 82.84 | 2026-05-27 |
| SheetCopilot Benchmark | Spreadsheets | 5 | 65.0% | 2026-05-27 |
| VNTL Leaderboard | Translation | 12 | 69.28 | 2026-05-06 |
| CG-Bench | Video | 10 | 24.9% open-ended acc. / 32.6% MCQ long acc. | 2026-05-28 |
No matching rows.