RedSage-Bench
Cybersecurity benchmark with 30K multiple-choice and 240 open-ended QA items covering knowledge, offensive skills, and tool expertise.
17rows
macro_accuracyprimary metric
2026-05-28sampled
Metadata
Metrics
Macro Accuracy, Knowledge General Accuracy, Knowledge Frameworks Accuracy, Offensive Skills Accuracy, Command-Line Tools Accuracy, Kali Tools Accuracy
| Rank | Subject | Macro Accuracy | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GPT-5 | 88.68% | GPT-5 openai-gpt-5 | Imported | 2026-05-28 |
| 2 | RedSage-8B-Ins | 85.73% | — | Imported | 2026-05-28 |
| 3 | Qwen3-32B | 85.4% | Qwen3 32B qwen-qwen3-32b | Imported | 2026-05-28 |
| 4 | RedSage-8B-Seed | 85.21% | — | Imported | 2026-05-28 |
| 5 | RedSage-8B-Base | 85.05% | — | Imported | 2026-05-28 |
| 6 | RedSage-8B-CFW | 84.86% | — | Imported | 2026-05-28 |
| 7 | RedSage-8B-DPO | 84.83% | — | Imported | 2026-05-28 |
| 8 | Qwen3-8B-Base | 84.24% | — | Imported | 2026-05-28 |
| 9 | Qwen3-8B | 81.85% | Qwen3 8B qwen-qwen3-8b | Imported | 2026-05-28 |
| 10 | DeepHat-V1-7B | 80.18% | — | Imported | 2026-05-28 |
| 11 | Foundation-Sec-8B | 78.51% | — | Imported | 2026-05-28 |
| 12 | Llama-3.1-8B | 78.02% | — | Imported | 2026-05-28 |
| 13 | Llama-3.1-8B-Instruct | 77.05% | Llama 3.1 8B Instruct meta-llama-llama-3.1-8b-instruct | Imported | 2026-05-28 |
| 14 | Llama-Primus-Base | 77.02% | — | Imported | 2026-05-28 |
| 15 | Foundation-Sec-8B-Instruct | 76.12% | — | Imported | 2026-05-28 |
| 16 | Llama-Primus-Merged | 74.81% | — | Imported | 2026-05-28 |
| 17 | Lily-Cybersecurity-7B-v0.2 | 71.19% | — | Imported | 2026-05-28 |
No matching rows.