OrgForge-IT

Synthetic insider-threat detection benchmark built from OrgForge organizational simulation telemetry with triage, verdict, and false-positive scoring.

10rows
verdict_f1primary metric
2026-05-28sampled

Metadata

Metrics

Verdict F1, Triage F1, Baseline FP Rate (lower is better), Verdict Precision, Verdict Recall

Latest Results

Rows are imported from public arXiv source LaTeX. The table reports a 10-model insider-threat detection leaderboard over a 51-day OrgForge corpus.

Rank Subject Verdict F1 Model Match Provenance Sampled
1 Claude Opus 4.6 1.000 Claude Opus 4.6
anthropic-claude-opus-4.6
Imported 2026-05-28
2 Devstral 2 123B 1.000 Imported 2026-05-28
3 Claude Haiku 4.5 0.800 Claude Haiku 4.5
anthropic-claude-haiku-4.5
Imported 2026-05-28
4 Claude Sonnet 4.6 0.800 Claude Sonnet 4.6
anthropic-claude-sonnet-4.6
Imported 2026-05-28
5 DeepSeek v3.2 0.800 DeepSeek V3.2
deepseek-deepseek-v3.2
Imported 2026-05-28
6 GLM-5 0.800 GLM GLM 5
z-ai-glm-5
Imported 2026-05-28
7 Mistral Large 675B 0.800 Imported 2026-05-28
8 Qwen3-Coder 0.800 Qwen3 Coder 480B A35B
qwen-qwen3-coder
Imported 2026-05-28
9 Llama 4 Maverick 0.800 Llama 4 Maverick
meta-llama-4-maverick
Imported 2026-05-28
10 Llama 3.3 70B 0.800 Llama 3.3 70B Instruct
meta-llama-llama-3.3-70b-instruct
Imported 2026-05-28