HELM Safety
HELM Safety: Measures model robustness, truthfulness, calibration, bias, harmfulness, jailbreak resistance, or alignment-relevant behavior.
62rows
mean_scoreprimary metric
2026-05-28sampled
Metadata
Metrics
Mean score, HarmBench - LM Evaluated Safety score, SimpleSafetyTests - LM Evaluated Safety score, BBQ - BBQ accuracy, Anthropic Red Team - LM Evaluated Safety score, XSTest - LM Evaluated Safety score
| Rank | Subject | Mean score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | o3 (2025-04-16) | 0.981606 | o3 openai-o3 | Imported | 2026-05-28 |
| 2 | Claude 4 Sonnet (20250514, extended thinking) | 0.980722 | — | Imported | 2026-05-28 |
| 3 | Claude 3.5 Sonnet (20240620) | 0.976697 | Claude 3.5 Sonnet anthropic-claude-3.5-sonnet | Imported | 2026-05-28 |
| 4 | o1 (2024-12-17) | 0.975800 | o1 openai-o1 | Imported | 2026-05-28 |
| 5 | Claude 4 Sonnet (20250514) | 0.974369 | — | Imported | 2026-05-28 |
| 6 | o4-mini (2025-04-16) | 0.973247 | o4 Mini openai-o4-mini | Imported | 2026-05-28 |
| 7 | Claude 4 Opus (20250514, extended thinking) | 0.969375 | — | Imported | 2026-05-28 |
| 8 | Claude 4 Opus (20250514) | 0.967675 | — | Imported | 2026-05-28 |
| 9 | Claude 3 Opus (20240229) | 0.967450 | — | Imported | 2026-05-28 |
| 10 | GPT-4.5 (2025-02-27 preview) | 0.964672 | GPT-4.5 openai-gpt-4.5-preview | Imported | 2026-05-28 |
| 11 | GPT-4.1 (2025-04-14) | 0.962853 | GPT-4.1 openai-gpt-4.1 | Imported | 2026-05-28 |
| 12 | o3-mini (2025-01-31) | 0.961961 | o3-mini openai-o3-mini | Imported | 2026-05-28 |
| 13 | GPT-4 Turbo (2024-04-09) | 0.960619 | GPT-4 Turbo openai-gpt-4-turbo | Imported | 2026-05-28 |
| 14 | o1-mini (2024-09-12) | 0.955323 | — | Imported | 2026-05-28 |
| 15 | GPT-4.1 mini (2025-04-14) | 0.948914 | GPT-4.1 Mini openai-gpt-4.1-mini | Imported | 2026-05-28 |
| 16 | Palmyra Fin | 0.947933 | — | Imported | 2026-05-28 |
| 17 | GPT-4o (2024-05-13) | 0.945905 | GPT-4o openai-gpt-4o | Imported | 2026-05-28 |
| 18 | Claude 3.7 Sonnet (20250219) | 0.944914 | Claude 3.7 Sonnet anthropic-claude-3.7-sonnet | Imported | 2026-05-28 |
| 19 | Claude 3 Sonnet (20240229) | 0.942842 | — | Imported | 2026-05-28 |
| 20 | GPT-4.1 nano (2025-04-14) | 0.937650 | GPT-4.1 Nano openai-gpt-4.1-nano | Imported | 2026-05-28 |
| 21 | Qwen2 Instruct (72B) | 0.932678 | — | Imported | 2026-05-28 |
| 22 | Palmyra-X-004 | 0.932553 | — | Imported | 2026-05-28 |
| 23 | Qwen2.5 Instruct Turbo (72B) | 0.931439 | — | Imported | 2026-05-28 |
| 24 | GPT-4o mini (2024-07-18) | 0.930425 | GPT-4o-mini openai-gpt-4o-mini | Imported | 2026-05-28 |
| 25 | Gemini 1.5 Flash (001) | 0.927331 | — | Imported | 2026-05-28 |
| 26 | Palmyra X5 | 0.926392 | Palmyra X5 writer-palmyra-x5 | Imported | 2026-05-28 |
| 27 | Gemini 1.5 Pro (001) | 0.924464 | — | Imported | 2026-05-28 |
| 28 | Gemini 2.5 Pro (03-25 preview) | 0.913978 | Gemini 2.5 Pro google-gemini-2.5-pro | Imported | 2026-05-28 |
| 29 | Gemini 2.5 Flash (04-17 preview) | 0.911812 | Gemini 2.5 Flash google-gemini-2.5-flash | Imported | 2026-05-28 |
| 30 | Gemini 2.0 Flash | 0.909540 | Gemini 2.0 Flash google-gemini-2.0-flash | Imported | 2026-05-28 |
| 31 | Gemini 2.0 Flash Lite (02-05 preview) | 0.907886 | Gemini 2.0 Flash Lite google-gemini-2.0-flash-lite-001 | Imported | 2026-05-28 |
| 32 | Llama 4 Maverick (17Bx128E) Instruct FP8 | 0.905925 | — | Imported | 2026-05-28 |
| 33 | Gemini 2.0 Pro (02-05 preview) | 0.905053 | — | Imported | 2026-05-28 |
| 34 | Grok 3 mini Beta | 0.902239 | Grok 3 Mini Beta x-ai-grok-3-mini-beta | Imported | 2026-05-28 |
| 35 | Qwen3 235B A22B FP8 Throughput | 0.899630 | — | Imported | 2026-05-28 |
| 36 | Qwen2.5 Instruct Turbo (7B) | 0.898686 | — | Imported | 2026-05-28 |
| 37 | Llama 3.1 Instruct Turbo (405B) | 0.896578 | — | Imported | 2026-05-28 |
| 38 | Llama 3 Instruct (70B) | 0.894906 | — | Imported | 2026-05-28 |
| 39 | DeepSeek-R1-0528 | 0.894417 | R1 0528 deepseek-deepseek-r1-0528 | Imported | 2026-05-28 |
| 40 | Qwen1.5 Chat (72B) | 0.886219 | — | Imported | 2026-05-28 |
| 41 | Llama 3 Instruct (8B) | 0.885747 | — | Imported | 2026-05-28 |
| 42 | Claude 3 Haiku (20240307) | 0.877981 | Claude 3 Haiku anthropic-claude-3-haiku | Imported | 2026-05-28 |
| 43 | Llama 4 Scout (17Bx16E) Instruct | 0.873136 | — | Imported | 2026-05-28 |
| 44 | DeepSeek LLM Chat (67B) | 0.872539 | — | Imported | 2026-05-28 |
| 45 | DeepSeek v3 | 0.871772 | DeepSeek V3 deepseek-deepseek-chat | Imported | 2026-05-28 |
| 46 | DeepSeek R1 (hide reasoning) | 0.868314 | R1 deepseek-r1 | Imported | 2026-05-28 |
| 47 | DeepSeek R1 | 0.865442 | R1 deepseek-r1 | Imported | 2026-05-28 |
| 48 | Llama 3.1 Instruct Turbo (8B) | 0.862092 | — | Imported | 2026-05-28 |
| 49 | Command R Plus | 0.860517 | — | Imported | 2026-05-28 |
| 50 | Palmyra Med | 0.856767 | — | Imported | 2026-05-28 |
| 51 | Grok 3 Beta | 0.855264 | Grok 3 Beta x-ai-grok-3-beta | Imported | 2026-05-28 |
| 52 | GPT-3.5 Turbo (0613) | 0.852869 | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-28 |
| 53 | Mixtral Instruct (8x22B) | 0.850761 | — | Imported | 2026-05-28 |
| 54 | Mistral Small 3 (2501) | 0.846781 | — | Imported | 2026-05-28 |
| 55 | Llama 3.1 Instruct Turbo (70B) | 0.844975 | — | Imported | 2026-05-28 |
| 56 | GPT-3.5 Turbo (1106) | 0.834979 | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-28 |
| 57 | Mixtral Instruct (8x7B) | 0.814336 | — | Imported | 2026-05-28 |
| 58 | GPT-3.5 Turbo (0125) | 0.813594 | GPT-3.5 Turbo openai-gpt-3.5-turbo | Imported | 2026-05-28 |
| 59 | Command R | 0.809403 | Command R (08-2024) cohere-command-r-08-2024 | Imported | 2026-05-28 |
| 60 | Mistral Instruct v0.3 (7B) | 0.729686 | — | Imported | 2026-05-28 |
| 61 | DBRX Instruct | 0.627667 | — | Imported | 2026-05-28 |
| 62 | Mistral Instruct v0.1 (7B) | 0.525541 | — | Imported | 2026-05-28 |
No matching rows.