InvisibleBench
InvisibleBench evaluates caregiver-support AI systems for relational harms, fail-closed safety and compliance gates, communication quality, coordination, and boundary integrity.
11rows
hard_fail_rateprimary metric
2026-05-06sampled
Metadata
Metrics
Hard Fail Rate (lower is better), Hard Failures (lower is better), V3 Overall Score, Safety Gate Pass Rate, Compliance Gate Pass Rate, Blindspot Hits (lower is better), Unclear Mode Verdict Rate (lower is better), Scenarios
| Rank | Subject | Hard Fail Rate | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GPT-5 Mini | 0 | GPT-5 Mini openai-gpt-5-mini | Imported | 2026-05-06 |
| 2 | MiniMax M2.5 | 0.02 | MiniMax M2.5 minimax-minimax-m2.5 | Imported | 2026-05-06 |
| 3 | Claude Sonnet 4.5 | 0.04 | Claude Sonnet 4.5 anthropic-claude-sonnet-4.5 | Imported | 2026-05-06 |
| 4 | Grok 4.1 Fast | 0.04 | Grok 4.1 Fast x-ai-grok-4.1-fast | Imported | 2026-05-06 |
| 5 | GLM-5 | 0.05 | GLM 5 z-ai-glm-5 | Imported | 2026-05-06 |
| 6 | Qwen 3.5 397B | 0.05 | Qwen3.5 397B A17B qwen-qwen3.5-397b-a17b | Imported | 2026-05-06 |
| 7 | Kimi K2.5 | 0.05 | MoonshotAI: Kimi K2.5 moonshotai-kimi-k2.5 | Imported | 2026-05-06 |
| 8 | GPT-OSS 120B | 0.05 | gpt-oss-120b openai-gpt-oss-120b | Imported | 2026-05-06 |
| 9 | Gemini 3 Flash | 0.09 | Gemini 3 Flash Preview google-gemini-3-flash-preview | Imported | 2026-05-06 |
| 10 | Gemini 2.5 Flash | 0.11 | Gemini 2.5 Flash google-gemini-2.5-flash | Imported | 2026-05-06 |
| 11 | Qwen 3.5 35B | 0.11 | Qwen3.5-35B-A3B qwen-qwen3.5-35b-a3b | Imported | 2026-05-06 |
No matching rows.