NYU LLM CTF
NYU LLM CTF evaluates autonomous agents on a 200-challenge capture-the-flag benchmark covering crypto, forensics, misc, pwn, rev, and web tasks.
15rows
solvedprimary metric
2026-05-06sampled
Metadata
Metrics
Solved, Crypto solved, Forensics solved, Misc solved, Pwn solved, Rev solved, Web solved
| Rank | Subject | Solved | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | CRAKEN (Self-RAG + Graph-RAG) (claude-3.5-sonnet-20241022) | 44 | — | Imported | 2026-05-06 |
| 2 | CRAKEN (Self-RAG) (claude-3.5-sonnet-20241022) | 42 | — | Imported | 2026-05-06 |
| 3 | D-CIPHER (claude-3.5-sonnet-20241022) | 38 | — | Imported | 2026-05-06 |
| 4 | CRAKEN (Self-RAG) (claude-3.7-sonnet-20250219) | 37 | — | Imported | 2026-05-06 |
| 5 | D-CIPHER (claude-3.7-sonnet-20250219) | 35 | — | Imported | 2026-05-06 |
| 6 | D-CIPHER (gpt-4.1) | 27 | — | Imported | 2026-05-06 |
| 7 | EnIGMA (claude-3.5-sonnet-20240620) | 27 | — | Imported | 2026-05-06 |
| 8 | CRAKEN (Self-RAG) (gpt-4.1) | 23 | — | Imported | 2026-05-06 |
| 9 | CRAKEN (Self-RAG) (gpt-4o) | 23 | — | Imported | 2026-05-06 |
| 10 | D-CIPHER (gpt-4o) | 21 | — | Imported | 2026-05-06 |
| 11 | EnIGMA (gpt-4o) | 19 | — | Imported | 2026-05-06 |
| 12 | EnIGMA (gpt-4-1106-preview) | 14 | — | Imported | 2026-05-06 |
| 13 | NYU CTF Baseline (gpt-4-0125-preview) | 10 | — | Imported | 2026-05-06 |
| 14 | NYU CTF Baseline (claude-3-haiku-20240307) | 8 | — | Imported | 2026-05-06 |
| 15 | NYU CTF Baseline (gpt-3.5-turbo-1106) | 8 | — | Imported | 2026-05-06 |
No matching rows.