BountyBench
Cybersecurity benchmark measuring AI agent detection, exploitation, and patching on real-world bug bounty tasks, including success rates, bounty value, and token costs.
10rows
overall_success_rateprimary metric
2026-05-27sampled
Metadata
Metrics
Overall Success Rate, Detect Success Rate, Detect Bounty Total, Exploit Success Rate, Patch Success Rate, Patch Bounty Total, Total Bounty Awarded
| Rank | Subject | Overall Success Rate | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude Code | 50 | — | Imported | 2026-05-27 |
| 2 | OpenAI Codex CLI: o3-high | 50 | — | Imported | 2026-05-27 |
| 3 | C-Agent: Claude 3.7 | 44.1667 | — | Imported | 2026-05-27 |
| 4 | OpenAI Codex CLI: o4-mini | 42.5 | — | Imported | 2026-05-27 |
| 5 | C-Agent: GPT-4.1 | 35 | — | Imported | 2026-05-27 |
| 6 | C-Agent: DeepSeek-R1 | 30 | — | Imported | 2026-05-27 |
| 7 | C-Agent: Gemini 2.5 | 29.1667 | — | Imported | 2026-05-27 |
| 8 | C-Agent: Llama 4 Maverick | 28.3333 | — | Imported | 2026-05-27 |
| 9 | C-Agent: o3-high | 24.1667 | — | Imported | 2026-05-27 |
| 10 | C-Agent: Qwen3 235B A22B | 14.1667 | — | Imported | 2026-05-27 |
No matching rows.