AIRTBench

AI red teaming benchmark evaluating language models' ability to autonomously discover and exploit AI/ML security vulnerabilities across 70 security challenges.

12rows
success_rateprimary metric
2026-05-06sampled

Metadata

Metrics

Success Rate

Latest Results

Rows are parsed from the public AIRTBench dataset card model success-rate table. The full experimental-results CSV/parquet includes conversations and is too large for this importer pass; source model display names are preserved.

Rank Subject Success Rate Model Match Provenance Sampled
1 claude-3-7-sonnet-20250219 46.86 Claude 3.7 Sonnet
anthropic-claude-3.7-sonnet
Imported 2026-05-06
2 gpt-4.5-preview 36.89 GPT-4.5
openai-gpt-4.5-preview
Imported 2026-05-06
3 gemini/gemini-2.5-pro-preview-05-06 34.29 Imported 2026-05-06
4 openai/o3-mini 28.43 o3-mini
openai-o3-mini
Imported 2026-05-06
5 together_ai/deepseek-ai/DeepSeek-R1 26.86 Imported 2026-05-06
6 gemini/gemini-2.5-flash-preview-04-17 26.43 Imported 2026-05-06
7 openai/gpt-4o 20.29 GPT-4o
openai-gpt-4o
Imported 2026-05-06
8 gemini/gemini-2.0-flash 16.86 Imported 2026-05-06
9 gemini/gemini-1.5-pro 15.14 Imported 2026-05-06
10 groq/meta-llama/llama-4-scout-17b-16e-instruct 1 Imported 2026-05-06
11 groq/qwen-qwq-32b 0.57 Imported 2026-05-06
12 groq/llama-3.3-70b-versatile 0 Imported 2026-05-06