PhiBench
PhiBench is an internal benchmark designed to evaluate diverse skills and reasoning abilities of language models, covering a wide range of tasks including coding (debugging, extending incomplete code, explaining code snippets) and mathematics (identifying proof errors, generating related problems). Created by Microsoft's research team to address limitations of standard academic benchmarks and guide the development of the Phi-4 model.
3rows
scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Score, Normalized Score
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Phi 4 Reasoning Plus | 0.74 | — | Self-reported | 2026-05-06 |
| 2 | Phi 4 Reasoning | 0.71 | — | Self-reported | 2026-05-06 |
| 3 | Phi 4 | 0.56 | Phi 4 microsoft-phi-4 | Self-reported | 2026-05-06 |
No matching rows.