Aider Refactoring Benchmark

Aider benchmark for model performance on code refactoring tasks.

14rows
percent_completed_correctlyprimary metric
2026-05-06sampled

Metadata

Metrics

Percent completed correctly, Percent using correct edit format

Latest Results

Rows ranked by the source table order.

Rank Subject Percent completed correctly Model Match Provenance Sampled
1 claude-3-5-sonnet-20241022 92.10 Claude 3.5 Sonnet
anthropic-claude-3.5-sonnet
Imported 2026-05-06
2 o1-preview 75.30 o1-preview
openai-o1-preview
Imported 2026-05-06
3 claude-3-opus-20240229 72.30 Imported 2026-05-06
4 claude-3.5-sonnet-20240620 64 Claude 3.5 Sonnet
anthropic-claude-3.5-sonnet
Imported 2026-05-06
5 gpt-4o 62.90 GPT-4o
openai-gpt-4o
Imported 2026-05-06
6 gpt-4-1106-preview 50.60 GPT-4
openai-gpt-4
Imported 2026-05-06
7 gpt-4o-2024-08-06 49.40 GPT-4o
openai-gpt-4o
Imported 2026-05-06
8 gemini/gemini-1.5-pro-latest 49.40 Imported 2026-05-06
9 o1-mini 44.90 Imported 2026-05-06
10 gpt-4-turbo-2024-04-09 (udiff) 34.10 GPT-4 Turbo
openai-gpt-4-turbo
Imported 2026-05-06
11 gpt-4-0125-preview 33.70 GPT-4
openai-gpt-4
Imported 2026-05-06
12 DeepSeek Coder V2 0724 (deprecated) 32.60 Imported 2026-05-06
13 DeepSeek Chat V2.5 31.50 Imported 2026-05-06
14 gpt-4-turbo-2024-04-09 (diff) 21.40 GPT-4 Turbo
openai-gpt-4-turbo
Imported 2026-05-06