Long Code Arena

Benchmark suite for long-context software engineering tasks including library-based code generation, CI build repair, commit message generation, bug localization, and module summarization.

9rows
mean_scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Mean Score, Mean Rank (lower is better), Mean Rank Std (lower is better), Library-based Code Generation, CI Builds Repair, Commit Message Generation, Bug Localization, Module Summarization

Latest Results

Rows ranked by highest aggregated Mean Score.

Rank Subject Mean Score Model Match Provenance Sampled
1 GPT-o1 0.96 Imported 2026-05-06
2 Claude 3.5 Sonnet 0.84 Claude 3.5 Sonnet
anthropic-claude-3.5-sonnet
Imported 2026-05-06
3 DeepSeek R1 0.80 R1
deepseek-r1
Imported 2026-05-06
4 GPT-4o 0.70 GPT-4o
openai-gpt-4o
Imported 2026-05-06
5 Gemini 1.5 Pro 0.58 Imported 2026-05-06
6 Llama 3.1 (405B) 0.47 Imported 2026-05-06
7 Claude 3 Haiku 0.42 Claude 3 Haiku
anthropic-claude-3-haiku
Imported 2026-05-06
8 Llama 3.1 (70B) 0.29 Imported 2026-05-06
9 Llama 3.1 (8B) 0.12 Imported 2026-05-06