CTFBench

CTFBench: Measures model robustness, truthfulness, calibration, bias, harmfulness, jailbreak resistance, or alignment-relevant behavior.

27rows
vulnerability_detection_rateprimary metric
2026-05-28sampled

Metadata

Metrics

Vulnerability Detection Rate, Overreporting Index (lower is better)

Latest Results

Rows are imported from the public CTFBench static HTML benchmark results table for AI smart contract auditors.

Rank Subject Vulnerability Detection Rate Model Match Provenance Sampled
1 SavantChat Dec 2025 1.000 Imported 2026-05-28
2 gpt_5.5 1.000 Imported 2026-05-28
3 gemini_3.1_pro 0.984 Imported 2026-05-28
4 SavantChat May 2025 0.952 Imported 2026-05-28
5 claude_opus_4.6 0.889 Imported 2026-05-28
6 SavantChat Mar 2025 0.857 Imported 2026-05-28
7 kimi_k2.6 0.825 Imported 2026-05-28
8 claude_opus_4.7 0.730 Imported 2026-05-28
9 gpt_5 0.714 Imported 2026-05-28
10 claude_opus_4.5 0.714 Imported 2026-05-28
11 deepseek_v4_pro 0.648 Imported 2026-05-28
12 gemini_2.5_pro 0.571 Imported 2026-05-28
13 gpt_5.4 0.571 Imported 2026-05-28
14 mimo_v2.5_pro 0.556 Imported 2026-05-28
15 grok 3 thinking 0.524 Imported 2026-05-28
16 ARMUR 0.524 Imported 2026-05-28
17 gpt_5.2 0.524 Imported 2026-05-28
18 minimax_m2.7 0.508 Imported 2026-05-28
19 openai_o3_mini_high 0.429 Imported 2026-05-28
20 openai_o3_mini 0.429 Imported 2026-05-28
21 deepseek_r1 0.429 Imported 2026-05-28
22 Code Genie AI 0.333 Imported 2026-05-28
23 slither 0.238 Imported 2026-05-28
24 QuillShield 0.143 Imported 2026-05-28
25 Aegis 0.143 Imported 2026-05-28
26 AuditOne 0.095 Imported 2026-05-28
27 SCAU 0.000 Imported 2026-05-28