Long Code Arena
Benchmark suite for long-context software engineering tasks including library-based code generation, CI build repair, commit message generation, bug localization, and module summarization.
9rows
mean_scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Mean Score, Mean Rank (lower is better), Mean Rank Std (lower is better), Library-based Code Generation, CI Builds Repair, Commit Message Generation, Bug Localization, Module Summarization
| Rank | Subject | Mean Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GPT-o1 | 0.96 | — | Imported | 2026-05-06 |
| 2 | Claude 3.5 Sonnet | 0.84 | Claude 3.5 Sonnet anthropic-claude-3.5-sonnet | Imported | 2026-05-06 |
| 3 | DeepSeek R1 | 0.80 | R1 deepseek-r1 | Imported | 2026-05-06 |
| 4 | GPT-4o | 0.70 | GPT-4o openai-gpt-4o | Imported | 2026-05-06 |
| 5 | Gemini 1.5 Pro | 0.58 | — | Imported | 2026-05-06 |
| 6 | Llama 3.1 (405B) | 0.47 | — | Imported | 2026-05-06 |
| 7 | Claude 3 Haiku | 0.42 | Claude 3 Haiku anthropic-claude-3-haiku | Imported | 2026-05-06 |
| 8 | Llama 3.1 (70B) | 0.29 | — | Imported | 2026-05-06 |
| 9 | Llama 3.1 (8B) | 0.12 | — | Imported | 2026-05-06 |
No matching rows.