CTFBench
CTFBench: Measures model robustness, truthfulness, calibration, bias, harmfulness, jailbreak resistance, or alignment-relevant behavior.
27rows
vulnerability_detection_rateprimary metric
2026-05-28sampled
Metadata
Metrics
Vulnerability Detection Rate, Overreporting Index (lower is better)
| Rank | Subject | Vulnerability Detection Rate | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | SavantChat Dec 2025 | 1.000 | — | Imported | 2026-05-28 |
| 2 | gpt_5.5 | 1.000 | — | Imported | 2026-05-28 |
| 3 | gemini_3.1_pro | 0.984 | — | Imported | 2026-05-28 |
| 4 | SavantChat May 2025 | 0.952 | — | Imported | 2026-05-28 |
| 5 | claude_opus_4.6 | 0.889 | — | Imported | 2026-05-28 |
| 6 | SavantChat Mar 2025 | 0.857 | — | Imported | 2026-05-28 |
| 7 | kimi_k2.6 | 0.825 | — | Imported | 2026-05-28 |
| 8 | claude_opus_4.7 | 0.730 | — | Imported | 2026-05-28 |
| 9 | gpt_5 | 0.714 | — | Imported | 2026-05-28 |
| 10 | claude_opus_4.5 | 0.714 | — | Imported | 2026-05-28 |
| 11 | deepseek_v4_pro | 0.648 | — | Imported | 2026-05-28 |
| 12 | gemini_2.5_pro | 0.571 | — | Imported | 2026-05-28 |
| 13 | gpt_5.4 | 0.571 | — | Imported | 2026-05-28 |
| 14 | mimo_v2.5_pro | 0.556 | — | Imported | 2026-05-28 |
| 15 | grok 3 thinking | 0.524 | — | Imported | 2026-05-28 |
| 16 | ARMUR | 0.524 | — | Imported | 2026-05-28 |
| 17 | gpt_5.2 | 0.524 | — | Imported | 2026-05-28 |
| 18 | minimax_m2.7 | 0.508 | — | Imported | 2026-05-28 |
| 19 | openai_o3_mini_high | 0.429 | — | Imported | 2026-05-28 |
| 20 | openai_o3_mini | 0.429 | — | Imported | 2026-05-28 |
| 21 | deepseek_r1 | 0.429 | — | Imported | 2026-05-28 |
| 22 | Code Genie AI | 0.333 | — | Imported | 2026-05-28 |
| 23 | slither | 0.238 | — | Imported | 2026-05-28 |
| 24 | QuillShield | 0.143 | — | Imported | 2026-05-28 |
| 25 | Aegis | 0.143 | — | Imported | 2026-05-28 |
| 26 | AuditOne | 0.095 | — | Imported | 2026-05-28 |
| 27 | SCAU | 0.000 | — | Imported | 2026-05-28 |
No matching rows.