CursorBench 3.1
Cursor's coding-agent benchmark for ambiguous, multi-file tasks sourced from real Cursor sessions, with CursorBench 3.1 adding codebase understanding, bugfinding, planning, and code review problems.
18rows
cursorbench_score_percentprimary metric
2026-05-28sampled
Metadata
Metrics
CursorBench Score, Avg Cost / Task (lower is better)
| Rank | Subject | CursorBench Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Opus 4.8 Max | 64.8% | — | Imported | 2026-05-28 |
| 2 | GPT-5.5 Extra High | 64.3% | — | Imported | 2026-05-28 |
| 3 | Composer 2.5 | 63.2% | — | Imported | 2026-05-28 |
| 4 | GPT-5.5 High | 62.6% | — | Imported | 2026-05-28 |
| 5 | Opus 4.8 Extra High | 61.6% | — | Imported | 2026-05-28 |
| 6 | Opus 4.8 High | 59.4% | — | Imported | 2026-05-28 |
| 7 | GPT-5.5 Medium | 59.2% | — | Imported | 2026-05-28 |
| 8 | Opus 4.8 Medium | 52.7% | — | Imported | 2026-05-28 |
| 9 | Composer 2 | 52.2% | — | Imported | 2026-05-28 |
| 10 | Gemini 3.5 Flash | 49.8% | — | Imported | 2026-05-28 |
| 11 | Sonnet 4.6 Max | 49.0% | — | Imported | 2026-05-28 |
| 12 | GPT-5.5 Low | 48.8% | — | Imported | 2026-05-28 |
| 13 | Sonnet 4.6 High | 48.8% | — | Imported | 2026-05-28 |
| 14 | Opus 4.8 Low | 48.3% | — | Imported | 2026-05-28 |
| 15 | Kimi 2.6 | 47.6% | — | Imported | 2026-05-28 |
| 16 | Sonnet 4.6 Medium | 46.0% | — | Imported | 2026-05-28 |
| 17 | Sonnet 4.6 Low | 41.5% | — | Imported | 2026-05-28 |
| 18 | Kimi 2.5 | 31.9% | — | Imported | 2026-05-28 |
No matching rows.