vexp-swe-bench

Open coding-agent benchmark harness comparing agent resolution rate, cost, and unique wins on a curated 100-task subset of SWE-bench Verified.

4rows
pass_at_1primary metric
2026-05-06sampled

Metadata

Metrics

Pass@1, Cost per Task (lower is better), Unique Wins Lower Bound

Latest Results

Rows ranked by highest Pass@1, with cost per task retained as a secondary metric.

Rank Subject Pass@1 Model Match Provenance Sampled
1 vexp + Claude Code 73 Imported 2026-05-06
2 Live-SWE-Agent 72 Imported 2026-05-06
3 OpenHands 70 Imported 2026-05-06
4 Sonar Foundation 70 Imported 2026-05-06