SafeArena

Safety benchmark for autonomous web agents over safe and harmful web tasks, including normalized safety, harmful completion, safe completion, and refusal rates.

5rows
normalized_safety_scoreprimary metric
2026-05-27sampled

Metadata

Metrics

Normalized Safety Score, Safe Completion Rate, Harmful Completion Rate (lower is better), Refusal Rate (lower is better)

Latest Results

Rows parsed from the SafeArena public Hugging Face Space table. The benchmark evaluates autonomous web-agent safety on safe and harmful tasks.

Rank Subject Normalized Safety Score Model Match Provenance Sampled
1 Claude-3.5-Sonnet-202406 55.0 Imported 2026-05-27
2 GPT-4o-Mini 35.7 Imported 2026-05-27
3 llama-3.2-90b-Vision-Instruct 34.0 Imported 2026-05-27
4 GPT-4o 31.7 Imported 2026-05-27
5 Qwen-2-VL-72B 21.5 Imported 2026-05-27