Agent Security League

AI coding agent security benchmark measuring functional correctness and security correctness across 200 real-world tasks spanning 77 CWE classes.

17rows
secure_scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Secure, Functional

Latest Results

Rows ranked by highest Secure percentage.

Rank Subject Secure Model Match Provenance Sampled
1 Cursor + GPT-5.5 23.50 Imported 2026-05-06
2 Cursor + Claude Opus 4.7 22.90 Imported 2026-05-06
3 Claude Code + Claude Opus 4.7 20.10 Imported 2026-05-06
4 Codex + GPT-5.5 20.10 Imported 2026-05-06
5 Codex + GPT-5.4 17.30 Imported 2026-05-06
6 Cursor + Gemini 3.1 Pro 13.40 Imported 2026-05-06
7 Cursor + GPT-5.3 12.80 Imported 2026-05-06
8 Cursor + Claude Opus 4.6 7.80 Imported 2026-05-06
9 Cursor + Gemini 3 Pro 7.30 Imported 2026-05-06
10 Claude Code + Claude Opus 4.5 10.10 Imported 2026-05-06
11 Claude Code + Claude Opus 4.6 8.40 Imported 2026-05-06
12 Claude Code + Gemini 3 Pro 8.40 Imported 2026-05-06
13 Claude Code + Claude Sonnet 4.6 7.80 Imported 2026-05-06
14 Claude Code + Claude Sonnet 4 6.10 Imported 2026-05-06
15 Claude Code + Gemini 2.5 Pro 5 Imported 2026-05-06
16 SWE-Agent + Claude Sonnet 4 7.80 Imported 2026-05-06
17 SWE-Agent + Gemini 2.5 Pro 4.50 Imported 2026-05-06