HELM AIR-Bench

HELM AIR-Bench: Measures model robustness, truthfulness, calibration, bias, harmfulness, jailbreak resistance, or alignment-relevant behavior.

87rows
refusal_rateprimary metric
2026-05-28sampled

Metadata

Metrics

Refusal Rate, Security Risks, Operational Misuses, Violence & Extremism, Hate/Toxicity, Sexual Content, Child Harm, Self-harm, Political Usage, Economic Harm, Deception, Manipulation, Defamation, Fundamental Rights, Discrimination/Bias, Privacy, Criminal Activities, Observed inference time (s) (lower is better), # eval

Latest Results

Rows are imported from the HELM public GCS AIR-Bench 2024 group JSON. Refusal Rate is the primary safety score and is reported as a percentage.

Rank Subject Refusal Rate Model Match Provenance Sampled
1 Claude 4.5 Haiku (20251001) 0.931507 Imported 2026-05-28
2 Claude 3.5 Sonnet (20241022) 0.908325 Claude 3.5 Sonnet
anthropic-claude-3.5-sonnet
Imported 2026-05-28
3 Claude 4.5 Sonnet (20250929) 0.898402 Imported 2026-05-28
4 Claude 4 Sonnet (20250514) 0.882684 Imported 2026-05-28
5 gpt-oss-120b 0.880049 gpt-oss-120b
openai-gpt-oss-120b
Imported 2026-05-28
6 GPT-5 nano (2025-08-07) 0.878205 GPT-5 Nano
openai-gpt-5-nano
Imported 2026-05-28
7 GPT-5 (2025-08-07) 0.876712 GPT-5
openai-gpt-5
Imported 2026-05-28
8 Qwen3-Next 80B A3B Thinking 0.866965 Qwen3 Next 80B A3B Thinking
qwen-qwen3-next-80b-a3b-thinking
Imported 2026-05-28
9 GPT-5.1 (2025-11-13) 0.861872 GPT-5.1
openai-gpt-5.1
Imported 2026-05-28
10 gpt-oss-20b 0.859677 gpt-oss-20b
openai-gpt-oss-20b
Imported 2026-05-28
11 Claude 3.5 Sonnet (20240620) 0.858974 Claude 3.5 Sonnet
anthropic-claude-3.5-sonnet
Imported 2026-05-28
12 Claude 4 Opus (20250514) 0.857394 Imported 2026-05-28
13 GPT-5 mini (2025-08-07) 0.857130 GPT-5 Mini
openai-gpt-5-mini
Imported 2026-05-28
14 Claude 3 Sonnet (20240229) 0.846944 Imported 2026-05-28
15 o3 (2025-04-16) 0.844661 o3
openai-o3
Imported 2026-05-28
16 Claude 3 Opus (20240229) 0.843695 Imported 2026-05-28
17 Gemini 1.5 Pro (001, BLOCK_NONE safety) 0.828328 Imported 2026-05-28
18 Claude 3 Haiku (20240307) 0.827011 Claude 3 Haiku
anthropic-claude-3-haiku
Imported 2026-05-28
19 IBM Granite 3.3 8B Instruct (with guardian) 0.825167 Imported 2026-05-28
20 IBM Granite 4.0 Small (with guardian) 0.820513 Imported 2026-05-28
21 Claude 3.7 Sonnet (20250219) 0.817703 Claude 3.7 Sonnet
anthropic-claude-3.7-sonnet
Imported 2026-05-28
22 IBM Granite 4.0 Micro (with guardian) 0.803916 Imported 2026-05-28
23 o1 (2024-12-17) 0.799614 o1
openai-o1
Imported 2026-05-28
24 Gemini 1.5 Flash (001, BLOCK_NONE safety) 0.794169 Imported 2026-05-28
25 Qwen3 235B A22B Instruct 2507 FP8 0.789691 Imported 2026-05-28
26 o4-mini (2025-04-16) 0.784861 o4 Mini
openai-o4-mini
Imported 2026-05-28
27 Palmyra X5 0.781700 W Palmyra X5
writer-palmyra-x5
Imported 2026-05-28
28 IBM Granite 3.3 8B Instruct 0.760450 Imported 2026-05-28
29 o3-mini (2025-01-31) 0.748858 o3-mini
openai-o3-mini
Imported 2026-05-28
30 GPT-4.5 (2025-02-27 preview) 0.741482 GPT-4.5
openai-gpt-4.5-preview
Imported 2026-05-28
31 Kimi K2 Instruct 0.741131 KIMI MoonshotAI: Kimi K2 0711
moonshotai-kimi-k2
Imported 2026-05-28
32 Gemini 2.5 Pro (03-25 preview) 0.735862 Gemini 2.5 Pro
google-gemini-2.5-pro
Imported 2026-05-28
33 Gemini 3 Pro (Preview) 0.732086 Gemini 3
google-gemini-3
Imported 2026-05-28
34 GPT-4 Turbo (2024-04-09) 0.718739 GPT-4 Turbo
openai-gpt-4-turbo
Imported 2026-05-28
35 IBM Granite 4.0 Small 0.715841 Imported 2026-05-28
36 Llama 3 Instruct (8B) 0.709168 Imported 2026-05-28
37 Gemini 2.5 Flash (04-17 preview) 0.686688 Gemini 2.5 Flash
google-gemini-2.5-flash
Imported 2026-05-28
38 Llama 4 Maverick (17Bx128E) Instruct FP8 0.685985 Imported 2026-05-28
39 Gemini 2.0 Pro (02-05 preview) 0.683966 Imported 2026-05-28
40 Gemini 2.0 Flash Lite (02-05 preview) 0.674570 Gemini 2.0 Flash Lite
google-gemini-2.0-flash-lite-001
Imported 2026-05-28
41 Gemini 1.5 Pro (002) 0.673340 Imported 2026-05-28
42 Gemini 1.5 Flash (002) 0.671057 Imported 2026-05-28
43 Palmyra Fin 0.662540 Imported 2026-05-28
44 Gemini 2.0 Flash 0.662188 Gemini 2.0 Flash
google-gemini-2.0-flash
Imported 2026-05-28
45 IBM Granite 4.0 Micro 0.660520 Imported 2026-05-28
46 Gemini 2.5 Flash-Lite 0.657710 Gemini 2.5 Flash Lite
google-gemini-2.5-flash-lite
Imported 2026-05-28
47 GPT-4.1 (2025-04-14) 0.647875 GPT-4.1
openai-gpt-4.1
Imported 2026-05-28
48 Llama 3 Instruct (70B) 0.646207 Imported 2026-05-28
49 GPT-4 (0613) 0.641728 GPT-4
openai-gpt-4
Imported 2026-05-28
50 GPT-3.5 Turbo (0301) 0.635494 GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-28
51 GPT-3.5 Turbo (0613) 0.631279 GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-28
52 GPT-4o (2024-08-06) 0.623463 GPT-4o
openai-gpt-4o
Imported 2026-05-28
53 Llama 3.1 Instruct Turbo (8B) 0.623375 Imported 2026-05-28
54 Qwen2 Instruct (72B) 0.621005 Imported 2026-05-28
55 GPT-4.1 nano (2025-04-14) 0.615297 GPT-4.1 Nano
openai-gpt-4.1-nano
Imported 2026-05-28
56 GPT-4.1 mini (2025-04-14) 0.604408 GPT-4.1 Mini
openai-gpt-4.1-mini
Imported 2026-05-28
57 Qwen2.5 Instruct Turbo (72B) 0.589744 Imported 2026-05-28
58 Llama 3.1 Instruct Turbo (405B) 0.586319 Imported 2026-05-28
59 Gemini 1.0 Pro (002) 0.581577 Imported 2026-05-28
60 Palmyra Med 0.577977 Imported 2026-05-28
61 GLM-4.5-Air-FP8 0.570864 GLM GLM 4.5 Air
z-ai-glm-4.5-air
Imported 2026-05-28
62 GPT-4o mini (2024-07-18) 0.562610 GPT-4o-mini
openai-gpt-4o-mini
Imported 2026-05-28
63 Qwen3 235B A22B FP8 Throughput 0.560327 Imported 2026-05-28
64 Yi Chat (34B) 0.536178 Imported 2026-05-28
65 Grok 3 mini Beta 0.535037 GROK Grok 3 Mini Beta
x-ai-grok-3-mini-beta
Imported 2026-05-28
66 DeepSeek R1 0.529066 R1
deepseek-r1
Imported 2026-05-28
67 GPT-4o (2024-05-13) 0.527924 GPT-4o
openai-gpt-4o
Imported 2026-05-28
68 GPT-3.5 Turbo (1106) 0.525378 GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-28
69 Llama 4 Scout (17Bx16E) Instruct 0.522655 Imported 2026-05-28
70 Grok 3 Beta 0.513435 GROK Grok 3 Beta
x-ai-grok-3-beta
Imported 2026-05-28
71 DeepSeek LLM Chat (67B) 0.505444 Imported 2026-05-28
72 Qwen1.5 Chat (72B) 0.485950 Imported 2026-05-28
73 Qwen2.5 Instruct Turbo (7B) 0.470320 Imported 2026-05-28
74 o1-mini (2024-09-12) 0.452494 Imported 2026-05-28
75 Grok 4 (0709) 0.443800 GROK Grok 4
x-ai-grok-4
Imported 2026-05-28
76 Palmyra-X-004 0.442396 Imported 2026-05-28
77 Mixtral Instruct (8x22B) 0.440376 Imported 2026-05-28
78 GPT-3.5 Turbo (0125) 0.439673 GPT-3.5 Turbo
openai-gpt-3.5-turbo
Imported 2026-05-28
79 Llama 3.1 Instruct Turbo (70B) 0.425009 Imported 2026-05-28
80 DeepSeek v3 0.407885 DeepSeek V3
deepseek-deepseek-chat
Imported 2026-05-28
81 Mixtral Instruct (8x7B) 0.391465 Imported 2026-05-28
82 Mistral Large 2 (2407) 0.352564 Imported 2026-05-28
83 Mistral Small 3 (2501) 0.327538 Imported 2026-05-28
84 Mistral Instruct v0.3 (7B) 0.325518 Imported 2026-05-28
85 Command R 0.317966 C Command R (08-2024)
cohere-command-r-08-2024
Imported 2026-05-28
86 Command R Plus 0.292677 Imported 2026-05-28
87 DBRX Instruct 0.253512 Imported 2026-05-28