OrgForge-IT
Synthetic insider-threat detection benchmark built from OrgForge organizational simulation telemetry with triage, verdict, and false-positive scoring.
10rows
verdict_f1primary metric
2026-05-28sampled
Metadata
Metrics
Verdict F1, Triage F1, Baseline FP Rate (lower is better), Verdict Precision, Verdict Recall
| Rank | Subject | Verdict F1 | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.6 | 1.000 | Claude Opus 4.6 anthropic-claude-opus-4.6 | Imported | 2026-05-28 |
| 2 | Devstral 2 123B | 1.000 | — | Imported | 2026-05-28 |
| 3 | Claude Haiku 4.5 | 0.800 | Claude Haiku 4.5 anthropic-claude-haiku-4.5 | Imported | 2026-05-28 |
| 4 | Claude Sonnet 4.6 | 0.800 | Claude Sonnet 4.6 anthropic-claude-sonnet-4.6 | Imported | 2026-05-28 |
| 5 | DeepSeek v3.2 | 0.800 | DeepSeek V3.2 deepseek-deepseek-v3.2 | Imported | 2026-05-28 |
| 6 | GLM-5 | 0.800 | GLM 5 z-ai-glm-5 | Imported | 2026-05-28 |
| 7 | Mistral Large 675B | 0.800 | — | Imported | 2026-05-28 |
| 8 | Qwen3-Coder | 0.800 | Qwen3 Coder 480B A35B qwen-qwen3-coder | Imported | 2026-05-28 |
| 9 | Llama 4 Maverick | 0.800 | Llama 4 Maverick meta-llama-4-maverick | Imported | 2026-05-28 |
| 10 | Llama 3.3 70B | 0.800 | Llama 3.3 70B Instruct meta-llama-llama-3.3-70b-instruct | Imported | 2026-05-28 |
No matching rows.