HiL-Bench
Scale AI's Human-in-Loop benchmark for measuring agents' selective escalation and help-seeking judgment.
10rows
pass_at_3primary metric
2026-05-05sampled
Metadata
Metrics
Pass@3, ASK-F1
| Rank | Subject | Pass@3 | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GPT-5.5 | 29.1% | GPT-5.5 openai-gpt-5.5 | Imported | 2026-05-05 |
| 2 | Claude Opus 4.7 | 27.67% | Claude Opus 4.7 anthropic-claude-opus-4.7 | Imported | 2026-05-05 |
| 3 | Claude Opus 4.6 | 24.33% | Claude Opus 4.6 anthropic-claude-opus-4.6 | Imported | 2026-05-05 |
| 4 | GLM-5.1 | 21% | GLM 5.1 z-ai-glm-5.1 | Imported | 2026-05-05 |
| 5 | Gemini 3.1 Pro | 20.33% | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Imported | 2026-05-05 |
| 6 | kimi-k2.6 | 14.67% | MoonshotAI: Kimi K2.6 moonshotai-kimi-k2.6 | Imported | 2026-05-05 |
| 7 | GPT-5.4 | 9.33% | GPT-5.4 openai-gpt-5.4 | Imported | 2026-05-05 |
| 8 | Grok-4.20 | 8% | Grok 4.20 x-ai-grok-4.20 | Imported | 2026-05-05 |
| 9 | Minimax-M2.5 | 7.33% | MiniMax M2.5 minimax-minimax-m2.5 | Imported | 2026-05-05 |
| 10 | GPT-5.3-codex | 3.67% | GPT-5.3-Codex openai-gpt-5.3-codex | Imported | 2026-05-05 |
No matching rows.