InvisibleBench

InvisibleBench evaluates caregiver-support AI systems for relational harms, fail-closed safety and compliance gates, communication quality, coordination, and boundary integrity.

11rows
hard_fail_rateprimary metric
2026-05-06sampled

Metadata

Metrics

Hard Fail Rate (lower is better), Hard Failures (lower is better), V3 Overall Score, Safety Gate Pass Rate, Compliance Gate Pass Rate, Blindspot Hits (lower is better), Unclear Mode Verdict Rate (lower is better), Scenarios

Latest Results

Rows are imported from the public InvisibleBench V3 leaderboard artifact and preserve source model display names and model IDs. Rankings follow the source hard-fail-first order; the upstream README cautions that hard-fail rates and failure signatures are stronger public claims than aggregate overall score.

Rank Subject Hard Fail Rate Model Match Provenance Sampled
1 GPT-5 Mini 0 GPT-5 Mini
openai-gpt-5-mini
Imported 2026-05-06
2 MiniMax M2.5 0.02 MiniMax M2.5
minimax-minimax-m2.5
Imported 2026-05-06
3 Claude Sonnet 4.5 0.04 Claude Sonnet 4.5
anthropic-claude-sonnet-4.5
Imported 2026-05-06
4 Grok 4.1 Fast 0.04 GROK Grok 4.1 Fast
x-ai-grok-4.1-fast
Imported 2026-05-06
5 GLM-5 0.05 GLM GLM 5
z-ai-glm-5
Imported 2026-05-06
6 Qwen 3.5 397B 0.05 Qwen3.5 397B A17B
qwen-qwen3.5-397b-a17b
Imported 2026-05-06
7 Kimi K2.5 0.05 KIMI MoonshotAI: Kimi K2.5
moonshotai-kimi-k2.5
Imported 2026-05-06
8 GPT-OSS 120B 0.05 gpt-oss-120b
openai-gpt-oss-120b
Imported 2026-05-06
9 Gemini 3 Flash 0.09 Gemini 3 Flash Preview
google-gemini-3-flash-preview
Imported 2026-05-06
10 Gemini 2.5 Flash 0.11 Gemini 2.5 Flash
google-gemini-2.5-flash
Imported 2026-05-06
11 Qwen 3.5 35B 0.11 Qwen3.5-35B-A3B
qwen-qwen3.5-35b-a3b
Imported 2026-05-06