Phi 4

Phi / Microsoft

54scores
49benchmarks
$0.065 / $0.14 per 1M tokenscost in/out

Metadata

Phi Closed/API

Aliases: microsoft-phi-4, microsoft/phi-4, phi-4

Benchmark Results

Benchmark Category Rank Score Sampled
Berkeley Function-Calling Leaderboard Agentic 70 28.79% 2026-05-27
Tau2-Bench Telecom Agentic 400 0% 2026-05-11
Terminal-Bench Hard Agentic 292 3.8% 2026-05-11
OpenUGI Alignment 1071 20.39 2026-05-06
Stick To Your Role! Alignment 27 0.32 2026-05-06
TextClass Benchmark Classification 36 1615.70 2026-05-06
BigCodeBench Coding 27 45.50 2026-05-06
SciCode Coding 312 26% 2026-05-11
GSMA Open Telco Leaderboard Domain 59 44.41 2026-05-06
AI Energy Score Efficiency 186 1 2026-05-06
Vectara HHEM Hallucination Leaderboard Factuality 4 96.30 2026-05-06
SECQUE Finance 5 0.56 2026-05-28
ALL Bench LLM General Knowledge 36 7.30 2026-05-06
BenchLM General Knowledge 92 28 2026-05-06
Open LLM Leaderboard v2 General Knowledge 969 30.36 2026-05-06
Open LLM Leaderboard v2 General Knowledge 1112 29.48 2026-05-06
HealthBench Hard Healthcare 29 0.34 2026-05-27
Artificial Analysis Intelligence Index Intelligence 418 10.41 2026-05-11
Humanity's Last Exam Intelligence 408 4.1% 2026-05-11
MMLU-Pro Intelligence 210 71.4% 2026-05-11
MuSR Intelligence 65 23.79 2026-05-06
MuSR Intelligence 70 23.72 2026-05-06
ANLI Language 9 42.50 2026-05-06
AraGen v3 Language 42 29.98 2026-05-06
HellaSwag Language 17 53.60 2026-05-06
Open Arabic LLM Leaderboard Language 123 45.69 2026-05-06
Open Portuguese LLM Leaderboard Language 148 83.13 2026-05-06
WinoGrande Language 18 73.40 2026-05-06
LEXam Legal 27 38.54% open / 40.66% MCQ 2026-05-28
AIME 2025 Math 214 18% 2026-05-11
MATH Level 5 Math 768 31.65 2026-05-06
MATH Level 5 Math 901 27.87 2026-05-06
OTIS Mock AIME 2024-2025 Mathematics 28 13.75 2026-05-06
PhiBench Mathematics 3 0.56 2026-05-06
BRIDGE Medical Leaderboard Medical 37 46.8 2026-05-27
BRIDGE Medical Leaderboard Medical 164 36.13 2026-05-27
BRIDGE Medical Leaderboard Medical 205 32.59 2026-05-27
MEDIC Benchmark Medical 64 60.11 average normalized public table score 2026-05-27
FLORES European Languages Leaderboard Multilingual 5 45.74 2026-05-06
INCLUDE-base-44 European Languages Multilingual 9 0.59 2026-05-06
LanguageBench Multilingual 22 0.45 2026-05-06
ALL Bench Multimodal Multimodal 36 15.21 2026-05-06
Artificial Analysis Openness Index Openness 57 50 2026-05-11
Balrog Reasoning 16 11.60 2026-05-06
BBH Reasoning 10 59.40 2026-05-06
DROP Reasoning 18 0.76 2026-05-06
GPQA Diamond Reasoning 298 57.5% 2026-05-11
LingOly-TOO Reasoning 14 0.11 2026-05-06
Halluverse-M3 Safety 8 70.38% 2026-05-28
CritPt Science 336 0% 2026-05-11
Structured Output Benchmark Structured Output 25 83.10 2026-05-06
VNTL Leaderboard Translation 22 68.60 2026-05-06
K-MetBench Weather 42 51.5% accuracy 2026-05-28
Lech Mazur Writing Writing 27 6.26 2026-05-06