FrontierSWE

Software-engineering agent benchmark targeting frontier-level implementation, performance optimization, and research tasks.

13rows
avg_rankprimary metric
2026-05-28sampled

Metadata

Metrics

Average Rank (lower is better), Dominance, Implementation Average Rank (lower is better), Performance Average Rank (lower is better), Research Average Rank (lower is better)

Showing 2 latest source slices.

Latest Results

Provider-published system-card benchmark scores parsed from Anthropic's Claude Opus 4.8 capability evaluation tables. Rows are marked self-reported and should be interpreted as source claims unless independently reproduced.

Rank Subject Average Rank Model Match Provenance Sampled
1 Claude Opus 4.8 2.7 avg rank Claude Opus 4.8
anthropic-claude-opus-4.8
Self-reported 2026-05-28
2 Claude Opus 4.7 4.2 avg rank Claude Opus 4.7
anthropic-claude-opus-4.7
Self-reported 2026-05-28
3 Claude Opus 4.6 4.9 avg rank Claude Opus 4.6
anthropic-claude-opus-4.6
Self-reported 2026-05-28
1 GPT-5.5 (Codex) 2.53 avg rank / 83% dominance Imported 2026-05-28
2 Claude Opus 4.7 (Claude Code) 3.56 avg rank / 72% dominance Imported 2026-05-28
3 Claude Opus 4.6 (Claude Code) 4.18 avg rank / 65% dominance Imported 2026-05-28
4 GPT-5.4 (Codex) 4.29 avg rank / 63% dominance Imported 2026-05-28
5 Composer 2.5 (Cursor CLI) 5.71 avg rank / 48% dominance Imported 2026-05-28
6 Gemini 3.1 Pro (Gemini CLI) 5.79 avg rank / 47% dominance Imported 2026-05-28
7 DeepSeek V4 Pro (Claude Code) 6.76 avg rank / 36% dominance Imported 2026-05-28
8 Kimi K2.6 (Kimi CLI) 7.12 avg rank / 32% dominance Imported 2026-05-28
9 Kimi K2.5 (Kimi CLI) 7.41 avg rank / 29% dominance Imported 2026-05-28
10 Qwen3.6-Plus (Qwen Code) 7.65 avg rank / 26% dominance Imported 2026-05-28