BountyBench

Cybersecurity benchmark measuring AI agent detection, exploitation, and patching on real-world bug bounty tasks, including success rates, bounty value, and token costs.

10rows
overall_success_rateprimary metric
2026-05-27sampled

Metadata

Metrics

Overall Success Rate, Detect Success Rate, Detect Bounty Total, Exploit Success Rate, Patch Success Rate, Patch Bounty Total, Total Bounty Awarded

Latest Results

Rows parsed from BountyBench public CSV. The primary score is the mean of detect, exploit, and patch success rates.

Rank Subject Overall Success Rate Model Match Provenance Sampled
1 Claude Code 50 Imported 2026-05-27
2 OpenAI Codex CLI: o3-high 50 Imported 2026-05-27
3 C-Agent: Claude 3.7 44.1667 Imported 2026-05-27
4 OpenAI Codex CLI: o4-mini 42.5 Imported 2026-05-27
5 C-Agent: GPT-4.1 35 Imported 2026-05-27
6 C-Agent: DeepSeek-R1 30 Imported 2026-05-27
7 C-Agent: Gemini 2.5 29.1667 Imported 2026-05-27
8 C-Agent: Llama 4 Maverick 28.3333 Imported 2026-05-27
9 C-Agent: o3-high 24.1667 Imported 2026-05-27
10 C-Agent: Qwen3 235B A22B 14.1667 Imported 2026-05-27