SWE-PRBench
Pull-request review benchmark with a public paper-baseline leaderboard for model review quality and false-positive behavior.
8rows
overall_sbarprimary metric
2026-05-27sampled
Metadata
Metrics
Overall (sbar), DR_A, False Positive Rate (lower is better)
| Rank | Subject | Overall (sbar) | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude Haiku 4.5 | 0.153 | Claude Haiku 4.5 anthropic-claude-haiku-4.5 | Imported | 2026-05-27 |
| 2 | Claude Sonnet 4.6 | 0.152 | Claude Sonnet 4.6 anthropic-claude-sonnet-4.6 | Imported | 2026-05-27 |
| 3 | DeepSeek V3 | 0.15 | DeepSeek V3 deepseek-deepseek-chat | Imported | 2026-05-27 |
| 4 | Mistral Large 3 | 0.147 | — | Imported | 2026-05-27 |
| 5 | GPT-4o | 0.113 | GPT-4o openai-gpt-4o | Imported | 2026-05-27 |
| 6 | GPT-4o-mini | 0.108 | GPT-4o-mini openai-gpt-4o-mini | Imported | 2026-05-27 |
| 7 | Mistral Small | 0.106 | — | Imported | 2026-05-27 |
| 8 | Llama 3.3 70B | 0.079 | Llama 3.3 70B Instruct meta-llama-llama-3.3-70b-instruct | Imported | 2026-05-27 |
No matching rows.