RedSage-Bench

Cybersecurity benchmark with 30K multiple-choice and 240 open-ended QA items covering knowledge, offensive skills, and tool expertise.

17rows
macro_accuracyprimary metric
2026-05-28sampled

Metadata

Metrics

Macro Accuracy, Knowledge General Accuracy, Knowledge Frameworks Accuracy, Offensive Skills Accuracy, Command-Line Tools Accuracy, Kali Tools Accuracy

Latest Results

Rows are imported from public arXiv source LaTeX RedSage-MCQ results. The table reports 0-shot MCQ accuracy over cybersecurity knowledge, skills, and tool categories.

Rank Subject Macro Accuracy Model Match Provenance Sampled
1 GPT-5 88.68% GPT-5
openai-gpt-5
Imported 2026-05-28
2 RedSage-8B-Ins 85.73% Imported 2026-05-28
3 Qwen3-32B 85.4% Qwen3 32B
qwen-qwen3-32b
Imported 2026-05-28
4 RedSage-8B-Seed 85.21% Imported 2026-05-28
5 RedSage-8B-Base 85.05% Imported 2026-05-28
6 RedSage-8B-CFW 84.86% Imported 2026-05-28
7 RedSage-8B-DPO 84.83% Imported 2026-05-28
8 Qwen3-8B-Base 84.24% Imported 2026-05-28
9 Qwen3-8B 81.85% Qwen3 8B
qwen-qwen3-8b
Imported 2026-05-28
10 DeepHat-V1-7B 80.18% Imported 2026-05-28
11 Foundation-Sec-8B 78.51% Imported 2026-05-28
12 Llama-3.1-8B 78.02% Imported 2026-05-28
13 Llama-3.1-8B-Instruct 77.05% Llama 3.1 8B Instruct
meta-llama-llama-3.1-8b-instruct
Imported 2026-05-28
14 Llama-Primus-Base 77.02% Imported 2026-05-28
15 Foundation-Sec-8B-Instruct 76.12% Imported 2026-05-28
16 Llama-Primus-Merged 74.81% Imported 2026-05-28
17 Lily-Cybersecurity-7B-v0.2 71.19% Imported 2026-05-28