PaperBench

End-to-end replication of state-of-the-art AI papers, graded against hierarchical rubrics.

1rows
replication_scoreprimary metric
2025-04-02sampled

Metadata

Metrics

Replication Score

Latest Results

OpenAI-reported best-performing tested agent. The score reflects a scaffolded agent setup, not a base-model-only run.

Rank Subject Replication Score Model Match Provenance Sampled
1 Claude 3.5 Sonnet (New) + open-source scaffolding 21.0% Claude 3.5 Sonnet
anthropic-claude-3.5-sonnet
Imported 2025-04-02