PhiBench

PhiBench is an internal benchmark designed to evaluate diverse skills and reasoning abilities of language models, covering a wide range of tasks including coding (debugging, extending incomplete code, explaining code snippets) and mathematics (identifying proof errors, generating related problems). Created by Microsoft's research team to address limitations of standard academic benchmarks and guide the development of the Phi-4 model.

3rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Normalized Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 Phi 4 Reasoning Plus 0.74 Self-reported 2026-05-06
2 Phi 4 Reasoning 0.71 Self-reported 2026-05-06
3 Phi 4 0.56 Phi 4
microsoft-phi-4
Self-reported 2026-05-06