APPS

APPS: Measures model capability on programming, code generation, code repair, or repository-level software tasks.

4rows
test_case_averageprimary metric
2026-05-27sampled

Metadata

Metrics

Test case average, Introductory test case average, Interview test case average, Competitive test case average, Strict accuracy, Introductory strict accuracy, Interview strict accuracy, Competition strict accuracy

Latest Results

Rows are parsed from the APPS paper arXiv LaTeX main results table.

Rank Subject Test case average Model Match Provenance Sampled
1 GPT-Neo 2.7B 10.15 Imported 2026-05-27
2 GPT-2 1.5B 7.96 Imported 2026-05-27
3 GPT-2 0.1B 6.16 Imported 2026-05-27
4 GPT-3 175B 0.55 Imported 2026-05-27