BrowserART
Browser Agent Red teaming Toolkit benchmark for evaluating whether browser agents pursue harmful web tasks despite refusal training in their underlying chat models.
8rows
harmful_behavior_pursuit_rateprimary metric
2026-05-28sampled
Metadata
Metrics
Human Rewrite ASR (lower is better), Direct Ask ASR (lower is better), Prefix ASR (lower is better), GCG ASR (lower is better), Random Search ASR (lower is better), Human Rewrite ASR (lower is better), Ensemble ASR (lower is better)
| Rank | Subject | Human Rewrite ASR | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | OpenHands + Opus-3 | 40% | — | Imported | 2026-05-28 |
| 2 | OpenHands + o1-preview | 63% | — | Imported | 2026-05-28 |
| 3 | OpenHands + Gemini-1.5 | 65% | — | Imported | 2026-05-28 |
| 4 | OpenHands + Sonnet-3.5 | 70% | — | Imported | 2026-05-28 |
| 5 | OpenHands + Llama-3.1 | 73% | — | Imported | 2026-05-28 |
| 6 | OpenHands + o1-mini | 84% | — | Imported | 2026-05-28 |
| 7 | OpenHands + GPT-4o | 98% | — | Imported | 2026-05-28 |
| 8 | OpenHands + GPT-4-turbo | 99% | — | Imported | 2026-05-28 |
No matching rows.