IDE-Bench

IDE coding-agent benchmark measuring first-attempt success rate across 80 interactive software-engineering tasks.

15rows
pass_at_1_accuracyprimary metric
2026-05-27sampled

Metadata

Metrics

Pass@1 Accuracy, 95% Confidence Interval (lower is better), Successful Trials, Total Trials, Average Iterations (lower is better)

Latest Results

Rows parsed from IDE-Bench's public Next.js leaderboard payload. Rankings are based on pass@1 accuracy across 80 IDE coding-agent tasks.

Rank Subject Pass@1 Accuracy Model Match Provenance Sampled
1 Claude Sonnet 4.5 87.5 Claude Sonnet 4.5
anthropic-claude-sonnet-4.5
Imported 2026-05-27
2 GPT 5.2 85 GPT-5.2
openai-gpt-5.2
Imported 2026-05-27
3 Claude Opus 4.5 83.75 Claude Opus 4.5
anthropic-claude-opus-4.5
Imported 2026-05-27
4 Claude Haiku 4.5 78.75 Claude Haiku 4.5
anthropic-claude-haiku-4.5
Imported 2026-05-27
5 GPT 5.1 Codex Max 73.75 GPT-5.1-Codex-Max
openai-gpt-5.1-codex-max
Imported 2026-05-27
6 Qwen 3 Max 65 Qwen3 Max
qwen-qwen3-max
Imported 2026-05-27
7 Qwen 3 Coder 57.5 Qwen3 Coder 480B A35B
qwen-qwen3-coder
Imported 2026-05-27
8 Gemini 3 Pro 55 Gemini 3
google-gemini-3
Imported 2026-05-27
9 Grok 4.1 Fast 35 GROK Grok 4.1 Fast
x-ai-grok-4.1-fast
Imported 2026-05-27
10 DeepSeek V3 31.25 DeepSeek V3.2
deepseek-deepseek-v3.2
Imported 2026-05-27
11 DeepSeek R1 20 R1 0528
deepseek-deepseek-r1-0528
Imported 2026-05-27
12 Grok Code Fast 11.25 GROK Grok Code Fast 1
x-ai-grok-code-fast-1
Imported 2026-05-27
13 Llama 4 Maverick 2.5 Llama 4 Maverick
meta-llama-4-maverick
Imported 2026-05-27
13 Llama 4 Scout 2.5 Llama 4 Scout
meta-llama-llama-4-scout
Imported 2026-05-27
15 Command R+ 0 C Command R (08-2024)
cohere-command-r-08-2024
Imported 2026-05-27