NYU LLM CTF

NYU LLM CTF evaluates autonomous agents on a 200-challenge capture-the-flag benchmark covering crypto, forensics, misc, pwn, rev, and web tasks.

15rows
solvedprimary metric
2026-05-06sampled

Metadata

Metrics

Solved, Crypto solved, Forensics solved, Misc solved, Pwn solved, Rev solved, Web solved

Latest Results

Snapshot mirrors the public NYU LLM CTF leaderboard for agent submissions on a 200-challenge CTF dataset. Display names preserve the source agent name and reported base model.

Rank Subject Solved Model Match Provenance Sampled
1 CRAKEN (Self-RAG + Graph-RAG) (claude-3.5-sonnet-20241022) 44 Imported 2026-05-06
2 CRAKEN (Self-RAG) (claude-3.5-sonnet-20241022) 42 Imported 2026-05-06
3 D-CIPHER (claude-3.5-sonnet-20241022) 38 Imported 2026-05-06
4 CRAKEN (Self-RAG) (claude-3.7-sonnet-20250219) 37 Imported 2026-05-06
5 D-CIPHER (claude-3.7-sonnet-20250219) 35 Imported 2026-05-06
6 D-CIPHER (gpt-4.1) 27 Imported 2026-05-06
7 EnIGMA (claude-3.5-sonnet-20240620) 27 Imported 2026-05-06
8 CRAKEN (Self-RAG) (gpt-4.1) 23 Imported 2026-05-06
9 CRAKEN (Self-RAG) (gpt-4o) 23 Imported 2026-05-06
10 D-CIPHER (gpt-4o) 21 Imported 2026-05-06
11 EnIGMA (gpt-4o) 19 Imported 2026-05-06
12 EnIGMA (gpt-4-1106-preview) 14 Imported 2026-05-06
13 NYU CTF Baseline (gpt-4-0125-preview) 10 Imported 2026-05-06
14 NYU CTF Baseline (claude-3-haiku-20240307) 8 Imported 2026-05-06
15 NYU CTF Baseline (gpt-3.5-turbo-1106) 8 Imported 2026-05-06