Phi 4
Phi / Microsoft
54scores
49benchmarks
$0.065 / $0.14 per 1M tokenscost in/out
Metadata
Phi Closed/API
Aliases: microsoft-phi-4, microsoft/phi-4, phi-4
| Benchmark | Category | Rank | Score | Sampled |
|---|---|---|---|---|
| Berkeley Function-Calling Leaderboard | Agentic | 70 | 28.79% | 2026-05-27 |
| Tau2-Bench Telecom | Agentic | 400 | 0% | 2026-05-11 |
| Terminal-Bench Hard | Agentic | 292 | 3.8% | 2026-05-11 |
| OpenUGI | Alignment | 1071 | 20.39 | 2026-05-06 |
| Stick To Your Role! | Alignment | 27 | 0.32 | 2026-05-06 |
| TextClass Benchmark | Classification | 36 | 1615.70 | 2026-05-06 |
| BigCodeBench | Coding | 27 | 45.50 | 2026-05-06 |
| SciCode | Coding | 312 | 26% | 2026-05-11 |
| GSMA Open Telco Leaderboard | Domain | 59 | 44.41 | 2026-05-06 |
| AI Energy Score | Efficiency | 186 | 1 | 2026-05-06 |
| Vectara HHEM Hallucination Leaderboard | Factuality | 4 | 96.30 | 2026-05-06 |
| SECQUE | Finance | 5 | 0.56 | 2026-05-28 |
| ALL Bench LLM | General Knowledge | 36 | 7.30 | 2026-05-06 |
| BenchLM | General Knowledge | 92 | 28 | 2026-05-06 |
| Open LLM Leaderboard v2 | General Knowledge | 969 | 30.36 | 2026-05-06 |
| Open LLM Leaderboard v2 | General Knowledge | 1112 | 29.48 | 2026-05-06 |
| HealthBench Hard | Healthcare | 29 | 0.34 | 2026-05-27 |
| Artificial Analysis Intelligence Index | Intelligence | 418 | 10.41 | 2026-05-11 |
| Humanity's Last Exam | Intelligence | 408 | 4.1% | 2026-05-11 |
| MMLU-Pro | Intelligence | 210 | 71.4% | 2026-05-11 |
| MuSR | Intelligence | 65 | 23.79 | 2026-05-06 |
| MuSR | Intelligence | 70 | 23.72 | 2026-05-06 |
| ANLI | Language | 9 | 42.50 | 2026-05-06 |
| AraGen v3 | Language | 42 | 29.98 | 2026-05-06 |
| HellaSwag | Language | 17 | 53.60 | 2026-05-06 |
| Open Arabic LLM Leaderboard | Language | 123 | 45.69 | 2026-05-06 |
| Open Portuguese LLM Leaderboard | Language | 148 | 83.13 | 2026-05-06 |
| WinoGrande | Language | 18 | 73.40 | 2026-05-06 |
| LEXam | Legal | 27 | 38.54% open / 40.66% MCQ | 2026-05-28 |
| AIME 2025 | Math | 214 | 18% | 2026-05-11 |
| MATH Level 5 | Math | 768 | 31.65 | 2026-05-06 |
| MATH Level 5 | Math | 901 | 27.87 | 2026-05-06 |
| OTIS Mock AIME 2024-2025 | Mathematics | 28 | 13.75 | 2026-05-06 |
| PhiBench | Mathematics | 3 | 0.56 | 2026-05-06 |
| BRIDGE Medical Leaderboard | Medical | 37 | 46.8 | 2026-05-27 |
| BRIDGE Medical Leaderboard | Medical | 164 | 36.13 | 2026-05-27 |
| BRIDGE Medical Leaderboard | Medical | 205 | 32.59 | 2026-05-27 |
| MEDIC Benchmark | Medical | 64 | 60.11 average normalized public table score | 2026-05-27 |
| FLORES European Languages Leaderboard | Multilingual | 5 | 45.74 | 2026-05-06 |
| INCLUDE-base-44 European Languages | Multilingual | 9 | 0.59 | 2026-05-06 |
| LanguageBench | Multilingual | 22 | 0.45 | 2026-05-06 |
| ALL Bench Multimodal | Multimodal | 36 | 15.21 | 2026-05-06 |
| Artificial Analysis Openness Index | Openness | 57 | 50 | 2026-05-11 |
| Balrog | Reasoning | 16 | 11.60 | 2026-05-06 |
| BBH | Reasoning | 10 | 59.40 | 2026-05-06 |
| DROP | Reasoning | 18 | 0.76 | 2026-05-06 |
| GPQA Diamond | Reasoning | 298 | 57.5% | 2026-05-11 |
| LingOly-TOO | Reasoning | 14 | 0.11 | 2026-05-06 |
| Halluverse-M3 | Safety | 8 | 70.38% | 2026-05-28 |
| CritPt | Science | 336 | 0% | 2026-05-11 |
| Structured Output Benchmark | Structured Output | 25 | 83.10 | 2026-05-06 |
| VNTL Leaderboard | Translation | 22 | 68.60 | 2026-05-06 |
| K-MetBench | Weather | 42 | 51.5% accuracy | 2026-05-28 |
| Lech Mazur Writing | Writing | 27 | 6.26 | 2026-05-06 |
No matching rows.