AIRTBench
AI red teaming benchmark evaluating language models' ability to autonomously discover and exploit AI/ML security vulnerabilities across 70 security challenges.
12rows
success_rateprimary metric
2026-05-06sampled
Metadata
Metrics
Success Rate
| Rank | Subject | Success Rate | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | claude-3-7-sonnet-20250219 | 46.86 | Claude 3.7 Sonnet anthropic-claude-3.7-sonnet | Imported | 2026-05-06 |
| 2 | gpt-4.5-preview | 36.89 | GPT-4.5 openai-gpt-4.5-preview | Imported | 2026-05-06 |
| 3 | gemini/gemini-2.5-pro-preview-05-06 | 34.29 | — | Imported | 2026-05-06 |
| 4 | openai/o3-mini | 28.43 | o3-mini openai-o3-mini | Imported | 2026-05-06 |
| 5 | together_ai/deepseek-ai/DeepSeek-R1 | 26.86 | — | Imported | 2026-05-06 |
| 6 | gemini/gemini-2.5-flash-preview-04-17 | 26.43 | — | Imported | 2026-05-06 |
| 7 | openai/gpt-4o | 20.29 | GPT-4o openai-gpt-4o | Imported | 2026-05-06 |
| 8 | gemini/gemini-2.0-flash | 16.86 | — | Imported | 2026-05-06 |
| 9 | gemini/gemini-1.5-pro | 15.14 | — | Imported | 2026-05-06 |
| 10 | groq/meta-llama/llama-4-scout-17b-16e-instruct | 1 | — | Imported | 2026-05-06 |
| 11 | groq/qwen-qwq-32b | 0.57 | — | Imported | 2026-05-06 |
| 12 | groq/llama-3.3-70b-versatile | 0 | — | Imported | 2026-05-06 |
No matching rows.