HiL-Bench

Scale AI's Human-in-Loop benchmark for measuring agents' selective escalation and help-seeking judgment.

10rows
pass_at_3primary metric
2026-05-05sampled

Metadata

Metrics

Pass@3, ASK-F1

Latest Results

Pass@3 values come from the embedded Scale Labs leaderboard entries. ASK-F1 is included where the public results chart exposes exact combined values; newer entries without charted ASK-F1 are left blank.

Rank Subject Pass@3 Model Match Provenance Sampled
1 GPT-5.5 29.1% GPT-5.5
openai-gpt-5.5
Imported 2026-05-05
2 Claude Opus 4.7 27.67% Claude Opus 4.7
anthropic-claude-opus-4.7
Imported 2026-05-05
3 Claude Opus 4.6 24.33% Claude Opus 4.6
anthropic-claude-opus-4.6
Imported 2026-05-05
4 GLM-5.1 21% GLM GLM 5.1
z-ai-glm-5.1
Imported 2026-05-05
5 Gemini 3.1 Pro 20.33% Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-05
6 kimi-k2.6 14.67% KIMI MoonshotAI: Kimi K2.6
moonshotai-kimi-k2.6
Imported 2026-05-05
7 GPT-5.4 9.33% GPT-5.4
openai-gpt-5.4
Imported 2026-05-05
8 Grok-4.20 8% GROK Grok 4.20
x-ai-grok-4.20
Imported 2026-05-05
9 Minimax-M2.5 7.33% MiniMax M2.5
minimax-minimax-m2.5
Imported 2026-05-05
10 GPT-5.3-codex 3.67% GPT-5.3-Codex
openai-gpt-5.3-codex
Imported 2026-05-05