CursorBench 3.1

Cursor's coding-agent benchmark for ambiguous, multi-file tasks sourced from real Cursor sessions, with CursorBench 3.1 adding codebase understanding, bugfinding, planning, and code review problems.

18rows
cursorbench_score_percentprimary metric
2026-05-28sampled

Metadata

Metrics

CursorBench Score, Avg Cost / Task (lower is better)

Latest Results

CursorBench 3.1 evaluates agents on ambiguous, multi-file tasks from real Cursor sessions. The public page states higher scores are better and reports average cost per task from published per-million-token pricing applied to tokens used on each task.

Rank Subject CursorBench Score Model Match Provenance Sampled
1 Opus 4.8 Max 64.8% Imported 2026-05-28
2 GPT-5.5 Extra High 64.3% Imported 2026-05-28
3 Composer 2.5 63.2% Imported 2026-05-28
4 GPT-5.5 High 62.6% Imported 2026-05-28
5 Opus 4.8 Extra High 61.6% Imported 2026-05-28
6 Opus 4.8 High 59.4% Imported 2026-05-28
7 GPT-5.5 Medium 59.2% Imported 2026-05-28
8 Opus 4.8 Medium 52.7% Imported 2026-05-28
9 Composer 2 52.2% Imported 2026-05-28
10 Gemini 3.5 Flash 49.8% Imported 2026-05-28
11 Sonnet 4.6 Max 49.0% Imported 2026-05-28
12 GPT-5.5 Low 48.8% Imported 2026-05-28
13 Sonnet 4.6 High 48.8% Imported 2026-05-28
14 Opus 4.8 Low 48.3% Imported 2026-05-28
15 Kimi 2.6 47.6% Imported 2026-05-28
16 Sonnet 4.6 Medium 46.0% Imported 2026-05-28
17 Sonnet 4.6 Low 41.5% Imported 2026-05-28
18 Kimi 2.5 31.9% Imported 2026-05-28