HELM AIR-Bench
HELM AIR-Bench: Measures model robustness, truthfulness, calibration, bias, harmfulness, jailbreak resistance, or alignment-relevant behavior.
87rows
refusal_rateprimary metric
2026-05-28sampled
Metadata
Metrics
Refusal Rate, Security Risks, Operational Misuses, Violence & Extremism, Hate/Toxicity, Sexual Content, Child Harm, Self-harm, Political Usage, Economic Harm, Deception, Manipulation, Defamation, Fundamental Rights, Discrimination/Bias, Privacy, Criminal Activities, Observed inference time (s) (lower is better), # eval
| Rank | Subject | Refusal Rate | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude 4.5 Haiku (20251001) | 0.931507 | — | Imported | 2026-05-28 |
| 2 | Claude 3.5 Sonnet (20241022) | 0.908325 | Claude 3.5 Sonnet anthropic-claude-3.5-sonnet | Imported | 2026-05-28 |
| 3 | Claude 4.5 Sonnet (20250929) | 0.898402 | — | Imported | 2026-05-28 |
| 4 | Claude 4 Sonnet (20250514) | 0.882684 | — | Imported | 2026-05-28 |
| 5 | gpt-oss-120b | 0.880049 | gpt-oss-120b openai-gpt-oss-120b | Imported | 2026-05-28 |
| 6 | GPT-5 nano (2025-08-07) | 0.878205 | GPT-5 Nano openai-gpt-5-nano | Imported | 2026-05-28 |
| 7 | GPT-5 (2025-08-07) | 0.876712 | GPT-5 openai-gpt-5 | Imported | 2026-05-28 |
| 8 | Qwen3-Next 80B A3B Thinking | 0.866965 | Qwen3 Next 80B A3B Thinking qwen-qwen3-next-80b-a3b-thinking | Imported | 2026-05-28 |
| 9 | GPT-5.1 (2025-11-13) | 0.861872 | GPT-5.1 openai-gpt-5.1 | Imported | 2026-05-28 |
| 10 | gpt-oss-20b | 0.859677 | gpt-oss-20b openai-gpt-oss-20b | Imported | 2026-05-28 |
| 11 | Claude 3.5 Sonnet (20240620) | 0.858974 | Claude 3.5 Sonnet anthropic-claude-3.5-sonnet | Imported | 2026-05-28 |
| 12 | Claude 4 Opus (20250514) | 0.857394 | — | Imported | 2026-05-28 |
| 13 | GPT-5 mini (2025-08-07) | 0.857130 | GPT-5 Mini openai-gpt-5-mini | Imported | 2026-05-28 |
| 14 | Claude 3 Sonnet (20240229) | 0.846944 | — | Imported | 2026-05-28 |
| 15 | o3 (2025-04-16) | 0.844661 | o3 openai-o3 | Imported | 2026-05-28 |
| 16 | Claude 3 Opus (20240229) | 0.843695 | — | Imported | 2026-05-28 |
| 17 | Gemini 1.5 Pro (001, BLOCK_NONE safety) | 0.828328 | — | Imported | 2026-05-28 |
| 18 | Claude 3 Haiku (20240307) | 0.827011 | Claude 3 Haiku anthropic-claude-3-haiku | Imported | 2026-05-28 |
| 19 | IBM Granite 3.3 8B Instruct (with guardian) | 0.825167 | — | Imported | 2026-05-28 |
| 20 | IBM Granite 4.0 Small (with guardian) | 0.820513 | — | Imported | 2026-05-28 |
| 21 | Claude 3.7 Sonnet (20250219) | 0.817703 | Claude 3.7 Sonnet anthropic-claude-3.7-sonnet | Imported | 2026-05-28 |
| 22 | IBM Granite 4.0 Micro (with guardian) | 0.803916 | — | Imported | 2026-05-28 |
| 23 | o1 (2024-12-17) | 0.799614 | o1 openai-o1 | Imported | 2026-05-28 |
| 24 | Gemini 1.5 Flash (001, BLOCK_NONE safety) | 0.794169 | — | Imported | 2026-05-28 |
| 25 | Qwen3 235B A22B Instruct 2507 FP8 | 0.789691 | — | Imported | 2026-05-28 |
| 26 | o4-mini (2025-04-16) | 0.784861 | o4 Mini openai-o4-mini | Imported | 2026-05-28 |
| 27 | Palmyra X5 | 0.781700 | Palmyra X5 writer-palmyra-x5 | Imported | 2026-05-28 |
| 28 | IBM Granite 3.3 8B Instruct | 0.760450 | — | Imported | 2026-05-28 |
| 29 | o3-mini (2025-01-31) | 0.748858 | o3-mini openai-o3-mini | Imported | 2026-05-28 |
| 30 | GPT-4.5 (2025-02-27 preview) | 0.741482 | GPT-4.5 openai-gpt-4.5-preview | Imported | 2026-05-28 |
| 31 | Kimi K2 Instruct | 0.741131 | MoonshotAI: Kimi K2 0711 moonshotai-kimi-k2 | Imported | 2026-05-28 |
| 32 | Gemini 2.5 Pro (03-25 preview) | 0.735862 | Gemini 2.5 Pro google-gemini-2.5-pro | Imported | 2026-05-28 |
| 33 | Gemini 3 Pro (Preview) | 0.732086 | Gemini 3 google-gemini-3 | Imported | 2026-05-28 |
| 34 | GPT-4 Turbo (2024-04-09) | 0.718739 | GPT-4 Turbo openai-gpt-4-turbo | Imported | 2026-05-28 |
| 35 | IBM Granite 4.0 Small | 0.715841 | — | Imported | 2026-05-28 |
| 36 | Llama 3 Instruct (8B) | 0.709168 | — | Imported | 2026-05-28 |
| 37 | Gemini 2.5 Flash (04-17 preview) | 0.686688 | Gemini 2.5 Flash google-gemini-2.5-flash | Imported | 2026-05-28 |
| 38 | Llama 4 Maverick (17Bx128E) Instruct FP8 | 0.685985 | — | Imported | 2026-05-28 |
| 39 | Gemini 2.0 Pro (02-05 preview) | 0.683966 | — | Imported | 2026-05-28 |
| 40 | Gemini 2.0 Flash Lite (02-05 preview) | 0.674570 | Gemini 2.0 Flash Lite google-gemini-2.0-flash-lite-001 | Imported | 2026-05-28 |
| 41 | Gemini 1.5 Pro (002) | 0.673340 | — | Imported | 2026-05-28 |
| 42 | Gemini 1.5 Flash (002) | 0.671057 | — | Imported | 2026-05-28 |
| 43 | Palmyra Fin | 0.662540 | — | Imported | 2026-05-28 |
| 44 | Gemini 2.0 Flash | 0.662188 | Gemini 2.0 Flash google-gemini-2.0-flash | Imported | 2026-05-28 |
| 45 | IBM Granite 4.0 Micro | 0.660520 | — | Imported | 2026-05-28 |
| 46 | Gemini 2.5 Flash-Lite | 0.657710 | Gemini 2.5 Flash Lite google-gemini-2.5-flash-lite | Imported | 2026-05-28 |
| 47 | GPT-4.1 (2025-04-14) | 0.647875 | GPT-4.1 openai-gpt-4.1 | Imported | 2026-05-28 |
| 48 | Llama 3 Instruct (70B) | 0.646207 | — | Imported | 2026-05-28 |
| 49 | GPT-4 (0613) | 0.641728 | GPT-4 openai-gpt-4 | Imported | 2026-05-28 |
| 50 | GPT-3.5 Turbo (0301) | 0.635494 | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-28 |
| 51 | GPT-3.5 Turbo (0613) | 0.631279 | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-28 |
| 52 | GPT-4o (2024-08-06) | 0.623463 | GPT-4o openai-gpt-4o | Imported | 2026-05-28 |
| 53 | Llama 3.1 Instruct Turbo (8B) | 0.623375 | — | Imported | 2026-05-28 |
| 54 | Qwen2 Instruct (72B) | 0.621005 | — | Imported | 2026-05-28 |
| 55 | GPT-4.1 nano (2025-04-14) | 0.615297 | GPT-4.1 Nano openai-gpt-4.1-nano | Imported | 2026-05-28 |
| 56 | GPT-4.1 mini (2025-04-14) | 0.604408 | GPT-4.1 Mini openai-gpt-4.1-mini | Imported | 2026-05-28 |
| 57 | Qwen2.5 Instruct Turbo (72B) | 0.589744 | — | Imported | 2026-05-28 |
| 58 | Llama 3.1 Instruct Turbo (405B) | 0.586319 | — | Imported | 2026-05-28 |
| 59 | Gemini 1.0 Pro (002) | 0.581577 | — | Imported | 2026-05-28 |
| 60 | Palmyra Med | 0.577977 | — | Imported | 2026-05-28 |
| 61 | GLM-4.5-Air-FP8 | 0.570864 | GLM 4.5 Air z-ai-glm-4.5-air | Imported | 2026-05-28 |
| 62 | GPT-4o mini (2024-07-18) | 0.562610 | GPT-4o-mini openai-gpt-4o-mini | Imported | 2026-05-28 |
| 63 | Qwen3 235B A22B FP8 Throughput | 0.560327 | — | Imported | 2026-05-28 |
| 64 | Yi Chat (34B) | 0.536178 | — | Imported | 2026-05-28 |
| 65 | Grok 3 mini Beta | 0.535037 | Grok 3 Mini Beta x-ai-grok-3-mini-beta | Imported | 2026-05-28 |
| 66 | DeepSeek R1 | 0.529066 | R1 deepseek-r1 | Imported | 2026-05-28 |
| 67 | GPT-4o (2024-05-13) | 0.527924 | GPT-4o openai-gpt-4o | Imported | 2026-05-28 |
| 68 | GPT-3.5 Turbo (1106) | 0.525378 | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-28 |
| 69 | Llama 4 Scout (17Bx16E) Instruct | 0.522655 | — | Imported | 2026-05-28 |
| 70 | Grok 3 Beta | 0.513435 | Grok 3 Beta x-ai-grok-3-beta | Imported | 2026-05-28 |
| 71 | DeepSeek LLM Chat (67B) | 0.505444 | — | Imported | 2026-05-28 |
| 72 | Qwen1.5 Chat (72B) | 0.485950 | — | Imported | 2026-05-28 |
| 73 | Qwen2.5 Instruct Turbo (7B) | 0.470320 | — | Imported | 2026-05-28 |
| 74 | o1-mini (2024-09-12) | 0.452494 | — | Imported | 2026-05-28 |
| 75 | Grok 4 (0709) | 0.443800 | Grok 4 x-ai-grok-4 | Imported | 2026-05-28 |
| 76 | Palmyra-X-004 | 0.442396 | — | Imported | 2026-05-28 |
| 77 | Mixtral Instruct (8x22B) | 0.440376 | — | Imported | 2026-05-28 |
| 78 | GPT-3.5 Turbo (0125) | 0.439673 | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-28 |
| 79 | Llama 3.1 Instruct Turbo (70B) | 0.425009 | — | Imported | 2026-05-28 |
| 80 | DeepSeek v3 | 0.407885 | DeepSeek V3 deepseek-deepseek-chat | Imported | 2026-05-28 |
| 81 | Mixtral Instruct (8x7B) | 0.391465 | — | Imported | 2026-05-28 |
| 82 | Mistral Large 2 (2407) | 0.352564 | — | Imported | 2026-05-28 |
| 83 | Mistral Small 3 (2501) | 0.327538 | — | Imported | 2026-05-28 |
| 84 | Mistral Instruct v0.3 (7B) | 0.325518 | — | Imported | 2026-05-28 |
| 85 | Command R | 0.317966 | Command R (08-2024) cohere-command-r-08-2024 | Imported | 2026-05-28 |
| 86 | Command R Plus | 0.292677 | — | Imported | 2026-05-28 |
| 87 | DBRX Instruct | 0.253512 | — | Imported | 2026-05-28 |
No matching rows.