SWE-PRBench

Pull-request review benchmark with a public paper-baseline leaderboard for model review quality and false-positive behavior.

8rows
overall_sbarprimary metric
2026-05-27sampled

Metadata

Metrics

Overall (sbar), DR_A, False Positive Rate (lower is better)

Latest Results

Rows parsed from the SWE-PRBench public Hugging Face dataset card leaderboard. Evaluation notes on the source specify evals/eval_100.json, GPT-5.2 judge, and pipeline v0.4.1.

Rank Subject Overall (sbar) Model Match Provenance Sampled
1 Claude Haiku 4.5 0.153 Claude Haiku 4.5
anthropic-claude-haiku-4.5
Imported 2026-05-27
2 Claude Sonnet 4.6 0.152 Claude Sonnet 4.6
anthropic-claude-sonnet-4.6
Imported 2026-05-27
3 DeepSeek V3 0.15 DeepSeek V3
deepseek-deepseek-chat
Imported 2026-05-27
4 Mistral Large 3 0.147 Imported 2026-05-27
5 GPT-4o 0.113 GPT-4o
openai-gpt-4o
Imported 2026-05-27
6 GPT-4o-mini 0.108 GPT-4o-mini
openai-gpt-4o-mini
Imported 2026-05-27
7 Mistral Small 0.106 Imported 2026-05-27
8 Llama 3.3 70B 0.079 Llama 3.3 70B Instruct
meta-llama-llama-3.3-70b-instruct
Imported 2026-05-27