APPS
APPS: Measures model capability on programming, code generation, code repair, or repository-level software tasks.
4rows
test_case_averageprimary metric
2026-05-27sampled
Metadata
Metrics
Test case average, Introductory test case average, Interview test case average, Competitive test case average, Strict accuracy, Introductory strict accuracy, Interview strict accuracy, Competition strict accuracy
| Rank | Subject | Test case average | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GPT-Neo 2.7B | 10.15 | — | Imported | 2026-05-27 |
| 2 | GPT-2 1.5B | 7.96 | — | Imported | 2026-05-27 |
| 3 | GPT-2 0.1B | 6.16 | — | Imported | 2026-05-27 |
| 4 | GPT-3 175B | 0.55 | — | Imported | 2026-05-27 |
No matching rows.