SafeArena
Safety benchmark for autonomous web agents over safe and harmful web tasks, including normalized safety, harmful completion, safe completion, and refusal rates.
5rows
normalized_safety_scoreprimary metric
2026-05-27sampled
Metadata
Metrics
Normalized Safety Score, Safe Completion Rate, Harmful Completion Rate (lower is better), Refusal Rate (lower is better)
| Rank | Subject | Normalized Safety Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude-3.5-Sonnet-202406 | 55.0 | — | Imported | 2026-05-27 |
| 2 | GPT-4o-Mini | 35.7 | — | Imported | 2026-05-27 |
| 3 | llama-3.2-90b-Vision-Instruct | 34.0 | — | Imported | 2026-05-27 |
| 4 | GPT-4o | 31.7 | — | Imported | 2026-05-27 |
| 5 | Qwen-2-VL-72B | 21.5 | — | Imported | 2026-05-27 |
No matching rows.