PandaBench

Comprehensive LLM safety benchmark for jailbreak attacks, defense mechanisms, judges, and safety-capability tradeoffs, aggregating attack success rates and AlpacaEval capability scores by model and defense method.

490rows
robustness_scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Robustness Score, Mean Attack Success Rate (lower is better), GCG Attack Success Rate (lower is better), PAIR GPT-4o Judge ASR (lower is better), PAIR Qwen Judge ASR (lower is better), PAIR Llama Judge ASR (lower is better), AlpacaEval Win Rate, AlpacaEval LC Win Rate, Aggregated Rows

Latest Results

Rows are aggregated by model_name and defense_method from PandaBench's public CSV. Attack success rates are lower-is-better; robustness_score is 100 minus mean attack success rate.

Rank Subject Robustness Score Model Match Provenance Sampled
1 Claude-3-5-sonnet + ICL 98.25 Imported 2026-05-06
2 Claude-3-5-sonnet + SelfReminder 98.09 Imported 2026-05-06
3 Claude-3-5-haiku + SelfReminder 97.99 Imported 2026-05-06
4 Claude-3-5-sonnet + GoalPriority 97.70 Imported 2026-05-06
5 Claude-3-5-sonnet + SelfDefense 97.65 Imported 2026-05-06
6 Qwen3-30B-A3B + SelfDefense 96.68 Imported 2026-05-06
7 GPT-4o-11-20 + SelfReminder 96.67 Imported 2026-05-06
8 Claude-3-5-sonnet + SmoothLLM 96.39 Imported 2026-05-06
9 Claude-3-5-haiku + GoalPriority 96.35 Imported 2026-05-06
10 Claude-3-5-haiku + ICL 96.24 Imported 2026-05-06
11 Qwen3-14B + SelfDefense 96.18 Imported 2026-05-06
12 GPT-4o-11-20 + ICL 96.11 Imported 2026-05-06
13 Claude-3-5-haiku + SelfDefense 96.10 Imported 2026-05-06
14 Llama-3.2-3B + SelfDefense 96 Imported 2026-05-06
15 Claude-3-7-sonnet + SelfDefense 95.94 Imported 2026-05-06
16 o3-mini + SelfDefense 95.93 Imported 2026-05-06
17 Qwen3-8B + SelfDefense 95.80 Imported 2026-05-06
18 Doubao-pro + SelfDefense 95.77 Imported 2026-05-06
19 Qwen3-30B-A3B + GoalPriority 95.73 Imported 2026-05-06
20 Claude-3-5-sonnet + Baseline 95.73 Imported 2026-05-06
21 Qwen3-32B + SelfDefense 95.68 Imported 2026-05-06
22 GPT-4o-11-20 + SelfDefense 95.66 Imported 2026-05-06
23 Claude-3-5-sonnet + PerplexityFilter 95.56 Imported 2026-05-06
24 Qwen2-72B + SelfDefense 95.49 Imported 2026-05-06
25 Llama-3.1-Tulu-3-70B + SelfDefense 95.44 Imported 2026-05-06
26 GPT-4o-11-20 + GoalPriority 95.35 Imported 2026-05-06
27 Doubao-lite + SelfDefense 95.23 Imported 2026-05-06
28 Claude-3-7-sonnet + SelfReminder 95.20 Imported 2026-05-06
29 Claude-3-5-haiku + RPO 95.16 Imported 2026-05-06
30 Doubao-pro + GoalPriority 95.14 Imported 2026-05-06
31 Qwen3-14B + GoalPriority 95.13 Imported 2026-05-06
32 Claude-3-5-sonnet + RPO 95.02 Imported 2026-05-06
33 Doubao-pro + SelfReminder 95.01 Imported 2026-05-06
34 Qwen3-30B-A3B + SelfReminder 94.99 Imported 2026-05-06
35 DS-Llama-70b + SelfDefense 94.91 Imported 2026-05-06
36 Qwen3-8B + GoalPriority 94.85 Imported 2026-05-06
37 Claude-3-7-sonnet + GoalPriority 94.76 Imported 2026-05-06
38 Doubao-1.5-pro + SelfReminder 94.72 Imported 2026-05-06
39 Doubao-lite + GoalPriority 94.67 Imported 2026-05-06
40 Llama-3.2-1B + SelfDefense 94.66 Imported 2026-05-06
41 GPT-4o-08-06 + SelfDefense 94.43 Imported 2026-05-06
42 o3-mini + GoalPriority 94.31 Imported 2026-05-06
43 Claude-3-7-sonnet + RPO 94.30 Imported 2026-05-06
44 Doubao-lite + SelfReminder 94.26 Imported 2026-05-06
45 Qwen3-8B + SelfReminder 94.05 Imported 2026-05-06
46 Claude-3-7-sonnet + ICL 93.94 Imported 2026-05-06
47 Qwen2-72B + SelfReminder 93.90 Imported 2026-05-06
48 o3-mini + SelfReminder 93.89 Imported 2026-05-06
49 Llama-3.1-Tulu-3-8B + SelfDefense 93.85 Imported 2026-05-06
50 Qwen3-4B + SelfDefense 93.81 Imported 2026-05-06
51 DS-Llama-70b + GoalPriority 93.75 Imported 2026-05-06
52 GPT-4o-mini + SelfDefense 93.70 Imported 2026-05-06
53 Qwen3-32B + GoalPriority 93.69 Imported 2026-05-06
54 Qwen3-14B + SelfReminder 93.66 Imported 2026-05-06
55 Claude-3-5-haiku + SmoothLLM 93.63 Imported 2026-05-06
56 GPT-4o-08-06 + SelfReminder 93.61 Imported 2026-05-06
57 Doubao-lite + ICL 93.57 Imported 2026-05-06
58 Llama-3.2-1B + GoalPriority 93.57 Imported 2026-05-06
59 Qwen3-32B + SelfReminder 93.53 Imported 2026-05-06
60 Gemini-2.0-flash + GoalPriority 93.47 Imported 2026-05-06
61 o3-mini + ICL 93.44 Imported 2026-05-06
62 Doubao-1.5-pro + SelfDefense 93.34 Imported 2026-05-06
63 Llama-3.2-3B + GoalPriority 93.32 Imported 2026-05-06
64 Doubao-1.5-pro + GoalPriority 93.26 Imported 2026-05-06
65 Claude-3-5-haiku + Baseline 93.23 Imported 2026-05-06
66 Claude-3-5-haiku + PerplexityFilter 93.16 Imported 2026-05-06
67 Llama-3.2-3B + SelfReminder 93.09 Imported 2026-05-06
68 Qwen2-72B + GoalPriority 92.95 Imported 2026-05-06
69 Llama-3.1-Tulu-3-70B + GoalPriority 92.92 Imported 2026-05-06
70 Gemini-2.0-flash + SelfDefense 92.84 Imported 2026-05-06
71 GPT-4o-mini + GoalPriority 92.83 Imported 2026-05-06
72 Gemma-2-2b-it + SelfDefense 92.80 Imported 2026-05-06
73 GPT-4o-08-06 + GoalPriority 92.75 Imported 2026-05-06
74 Doubao-1.5-lite + SelfDefense 92.69 Imported 2026-05-06
75 DS-v3-0324 + SelfReminder 92.60 Imported 2026-05-06
76 Qwen3-1.7B + SelfDefense 92.48 Imported 2026-05-06
77 Llama-3.1-Tulu-3-70B + SelfReminder 92.39 Imported 2026-05-06
78 Doubao-1.5-lite + SelfReminder 92.28 Imported 2026-05-06
79 Llama-3-1-405B + SelfDefense 92.26 Imported 2026-05-06
80 Qwen2.5-14B + SelfDefense 92.23 Imported 2026-05-06
81 Doubao-pro + ICL 92.17 Imported 2026-05-06
82 Llama-3.2-3B + RPO 92.15 Imported 2026-05-06
83 Qwen2.5-72B + SelfDefense 92.08 Imported 2026-05-06
84 Qwen2.5-32B + SelfDefense 92.08 Imported 2026-05-06
85 Llama-3-1-405B + SelfReminder 92.07 Imported 2026-05-06
86 Kimi-latest + SelfDefense 92.02 Imported 2026-05-06
87 o3-mini + PerplexityFilter 91.97 Imported 2026-05-06
88 Llama-3.1-8B + SelfDefense 91.95 Imported 2026-05-06
89 o3-mini + RPO 91.94 Imported 2026-05-06
90 DS-Llama-70b + ICL 91.84 Imported 2026-05-06
91 Claude-3-7-sonnet + SmoothLLM 91.78 Imported 2026-05-06
92 GPT-4o-mini + SelfReminder 91.77 Imported 2026-05-06
93 Gemini-2.0-flash + SelfReminder 91.77 Imported 2026-05-06
94 DS-v3-0324 + SelfDefense 91.72 Imported 2026-05-06
95 Phi-3-mini + SelfDefense 91.69 Imported 2026-05-06
96 Doubao-pro + PerplexityFilter 91.68 Imported 2026-05-06
97 Qwen2-7B + SelfDefense 91.68 Imported 2026-05-06
98 Llama-3.2-1B + RPO 91.63 Imported 2026-05-06
99 o3-mini + Baseline 91.60 Imported 2026-05-06
100 Phi-3-5-MoE + SelfDefense 91.59 Imported 2026-05-06
101 Qwen3-4B + GoalPriority 91.52 Imported 2026-05-06
102 GPT-4o-11-20 + RPO 91.52 Imported 2026-05-06
103 Doubao-lite + SmoothLLM 91.39 Imported 2026-05-06
104 GLM-4-plus + SelfDefense 91.38 Imported 2026-05-06
105 Doubao-pro + Baseline 91.38 Imported 2026-05-06
106 Doubao-lite + PerplexityFilter 91.34 Imported 2026-05-06
107 Llama-3.1-70B + SelfDefense 91.33 Imported 2026-05-06
108 Qwen3-30B-A3B + ICL 91.31 Imported 2026-05-06
109 GPT-4o-11-20 + PerplexityFilter 91.19 Imported 2026-05-06
110 Llama-3.2-3B + ICL 91.09 Imported 2026-05-06
111 Doubao-1.5-lite + GoalPriority 91.08 Imported 2026-05-06
112 GPT-4o-mini + ICL 91.06 Imported 2026-05-06
113 Llama-3.2-1B + SelfReminder 91.06 Imported 2026-05-06
114 Llama-3.2-3B + PerplexityFilter 91.02 Imported 2026-05-06
115 Doubao-lite + Baseline 91.02 Imported 2026-05-06
116 Llama-3-1-405B + GoalPriority 90.95 Imported 2026-05-06
117 GPT-4o-08-06 + ICL 90.93 Imported 2026-05-06
118 GPT-4o-11-20 + Baseline 90.91 Imported 2026-05-06
119 Qwen3-4B + SelfReminder 90.90 Imported 2026-05-06
120 Llama-3.1-70B + GoalPriority 90.76 Imported 2026-05-06
121 Qwen2-72B + ICL 90.74 Imported 2026-05-06
122 Doubao-pro + RPO 90.55 Imported 2026-05-06
123 Qwen2.5-7B + SelfDefense 90.51 Imported 2026-05-06
124 GPT-4o-08-06 + RPO 90.47 Imported 2026-05-06
125 DS-v3-0324 + GoalPriority 90.33 Imported 2026-05-06
126 o3-mini + SmoothLLM 90.30 Imported 2026-05-06
127 Llama-3.3-70B + SelfDefense 90.22 Imported 2026-05-06
128 Claude-3-7-sonnet + Baseline 90.17 Imported 2026-05-06
129 Qwen2.5-1.5B + SelfDefense 90.15 Imported 2026-05-06
130 Qwen2-72B + SmoothLLM 90.15 Imported 2026-05-06
131 Doubao-1.5-lite + ICL 90.14 Imported 2026-05-06
132 Llama-3.1-Tulu-3-70B + ICL 89.97 Imported 2026-05-06
133 Gemini-2.0-flash-lite + SelfDefense 89.95 Imported 2026-05-06
134 DS-r1 + SelfDefense 89.85 Imported 2026-05-06
135 Qwen2.5-3B + SelfDefense 89.82 Imported 2026-05-06
136 Claude-3-7-sonnet + PerplexityFilter 89.78 Imported 2026-05-06
137 Phi-3-5-MoE + ICL 89.77 Imported 2026-05-06
138 GPT-4o-11-20 + SmoothLLM 89.72 Imported 2026-05-06
139 Kimi-latest + SelfReminder 89.66 Imported 2026-05-06
140 Qwen3-14B + ICL 89.65 Imported 2026-05-06
141 Kimi-latest + GoalPriority 89.64 Imported 2026-05-06
142 Qwen3-8B + ICL 89.52 Imported 2026-05-06
143 Gemini-2.0-pro + SelfReminder 89.51 Imported 2026-05-06
144 Llama-3.1-70B + SelfReminder 89.47 Imported 2026-05-06
145 GPT-4o-11-20 + Semantic SmoothLLM 89.35 Imported 2026-05-06
146 Doubao-1.5-pro + ICL 89.34 Imported 2026-05-06
147 Doubao-pro + SmoothLLM 89.27 Imported 2026-05-06
148 Llama-3.2-3B + SmoothLLM 89.24 Imported 2026-05-06
149 DS-v3 + SelfDefense 89.23 Imported 2026-05-06
150 GLM-4-flash + SelfDefense 89.20 Imported 2026-05-06
151 Claude-3-5-haiku + Paraphrase 89.17 Imported 2026-05-06
152 GPT-4o-mini + RPO 89.13 Imported 2026-05-06
153 Qwen3-30B-A3B + Paraphrase 88.95 Imported 2026-05-06
154 Kimi-latest + PerplexityFilter 88.94 Imported 2026-05-06
155 Qwen2-72B + Baseline 88.92 Imported 2026-05-06
156 Gemini-2.0-pro + SelfDefense 88.89 Imported 2026-05-06
157 Llama-3.1-8B + SelfReminder 88.74 Imported 2026-05-06
158 Llama-3.1-Tulu-3-70B + RPO 88.73 Imported 2026-05-06
159 Phi-3-5-MoE + Baseline 88.72 Imported 2026-05-06
160 GLM-4-plus + SelfReminder 88.69 Imported 2026-05-06
161 Llama-3-1-405B + ICL 88.67 Imported 2026-05-06
162 Llama-3.2-1B + ICL 88.67 Imported 2026-05-06
163 Qwen2-72B + PerplexityFilter 88.57 Imported 2026-05-06
164 Phi-3-5-MoE + GoalPriority 88.55 Imported 2026-05-06
165 Gemini-2.0-pro + GoalPriority 88.55 Imported 2026-05-06
166 Qwen2.5-72B + GoalPriority 88.54 Imported 2026-05-06
167 Qwen2.5-72B + SelfReminder 88.50 Imported 2026-05-06
168 Qwen2-72B + RPO 88.47 Imported 2026-05-06
169 GPT-4o-mini + PerplexityFilter 88.47 Imported 2026-05-06
170 Qwen2.5-0.5B + SelfDefense 88.42 Imported 2026-05-06
171 Phi-3-mini + SelfReminder 88.38 Imported 2026-05-06
172 GPT-4o-mini + SmoothLLM 88.37 Imported 2026-05-06
173 Llama-3.1-Tulu-3-8B + GoalPriority 88.31 Imported 2026-05-06
174 Qwen3-32B + ICL 88.23 Imported 2026-05-06
175 Qwen3-0.6B + SelfDefense 88.23 Imported 2026-05-06
176 Phi-3-mini + GoalPriority 88.21 Imported 2026-05-06
177 Qwen3-8B + Paraphrase 88.19 Imported 2026-05-06
178 Llama-3.2-1B + SmoothLLM 88.14 Imported 2026-05-06
179 GLM-4-plus + GoalPriority 88.12 Imported 2026-05-06
180 Doubao-lite + RPO 88.08 Imported 2026-05-06
181 GPT-4o-08-06 + SmoothLLM 88.05 Imported 2026-05-06
182 Llama-3.1-8B + GoalPriority 88.02 Imported 2026-05-06
183 GPT-4o-08-06 + PerplexityFilter 87.88 Imported 2026-05-06
184 Qwen3-14B + Paraphrase 87.69 Imported 2026-05-06
185 Llama-3.2-1B + PerplexityFilter 87.64 Imported 2026-05-06
186 Doubao-lite + Paraphrase 87.64 Imported 2026-05-06
187 GPT-4o-mini + Semantic SmoothLLM 87.58 Imported 2026-05-06
188 Phi-3-5-MoE + Semantic SmoothLLM 87.53 Imported 2026-05-06
189 GPT-4o-mini + Baseline 87.52 Imported 2026-05-06
190 Llama-3.1-Tulu-3-8B + SelfReminder 87.39 Imported 2026-05-06
191 GPT-4o-08-06 + Baseline 87.38 Imported 2026-05-06
192 Qwen3-4B + Paraphrase 87.27 Imported 2026-05-06
193 Doubao-1.5-pro + Baseline 87.24 Imported 2026-05-06
194 Phi-3-5-MoE + PerplexityFilter 87.22 Imported 2026-05-06
195 GPT-4o-08-06 + Semantic SmoothLLM 87.14 Imported 2026-05-06
196 Llama-3.1-Tulu-3-70B + Baseline 87.14 Imported 2026-05-06
197 DS-2-1212 + SelfDefense 87.14 Imported 2026-05-06
198 Claude-3-5-sonnet + Paraphrase 87.13 Imported 2026-05-06
199 DS-v3-0324 + ICL 87.11 Imported 2026-05-06
200 Phi-3-5-MoE + SelfReminder 87.10 Imported 2026-05-06
201 Phi-3-5-MoE + RPO 87.07 Imported 2026-05-06
202 Doubao-1.5-pro + PerplexityFilter 87.06 Imported 2026-05-06
203 Llama-3.1-Tulu-3-70B + PerplexityFilter 87.03 Imported 2026-05-06
204 Doubao-1.5-lite + PerplexityFilter 87 Imported 2026-05-06
205 Qwen2.5-72B + ICL 86.95 Imported 2026-05-06
206 Qwen3-32B + Paraphrase 86.94 Imported 2026-05-06
207 Qwen3-30B-A3B + SmoothLLM 86.92 Imported 2026-05-06
208 Llama-3.2-3B + Baseline 86.90 Imported 2026-05-06
209 Gemini-2.0-flash-lite + GoalPriority 86.83 Imported 2026-05-06
210 Qwen2.5-14B + GoalPriority 86.67 Imported 2026-05-06
211 Doubao-1.5-lite + Baseline 86.60 Imported 2026-05-06
212 Qwen2.5-72B + SmoothLLM 86.40 Imported 2026-05-06
213 Doubao-1.5-pro + RPO 86.36 Imported 2026-05-06
214 Phi-3-mini + ICL 86.36 Imported 2026-05-06
215 Qwen3-30B-A3B + RPO 86.30 Imported 2026-05-06
216 Phi-3-5-MoE + SmoothLLM 86.19 Imported 2026-05-06
217 Llama-3.1-Tulu-3-70B + Paraphrase 86.03 Imported 2026-05-06
218 DS-v3 + GoalPriority 86.02 Imported 2026-05-06
219 Qwen3-8B + Semantic SmoothLLM 85.99 Imported 2026-05-06
220 Llama-3.2-1B + Baseline 85.81 Imported 2026-05-06
221 Doubao-1.5-lite + RPO 85.80 Imported 2026-05-06
222 DS-Llama-70b + SelfReminder 85.72 Imported 2026-05-06
223 Qwen2.5-14B + SelfReminder 85.67 Imported 2026-05-06
224 Gemma-2-2b-it + SmoothLLM 85.52 Imported 2026-05-06
225 Qwen2.5-32B + SelfReminder 85.51 Imported 2026-05-06
226 Qwen3-1.7B + Paraphrase 85.48 Imported 2026-05-06
227 Qwen2.5-72B + PerplexityFilter 85.48 Imported 2026-05-06
228 Llama-3.1-Tulu-3-8B + RPO 85.43 Imported 2026-05-06
229 Qwen2.5-32B + GoalPriority 85.42 Imported 2026-05-06
230 Qwen3-30B-A3B + Semantic SmoothLLM 85.20 Imported 2026-05-06
231 Qwen2.5-72B + Baseline 85.18 Imported 2026-05-06
232 Qwen2.5-72B + RPO 85.17 Imported 2026-05-06
233 Qwen2.5-14B + SmoothLLM 85.10 Imported 2026-05-06
234 Phi-3-mini + Baseline 85.04 Imported 2026-05-06
235 Phi-3-mini + Paraphrase 85.02 Imported 2026-05-06
236 Llama-3.1-Tulu-3-70B + SmoothLLM 85 Imported 2026-05-06
237 Qwen3-4B + ICL 84.92 Imported 2026-05-06
238 Claude-3-7-sonnet + Paraphrase 84.90 Imported 2026-05-06
239 Kimi-latest + ICL 84.82 Imported 2026-05-06
240 Phi-3-mini + SmoothLLM 84.77 Imported 2026-05-06
241 Llama-3.1-70B + RPO 84.74 Imported 2026-05-06
242 Phi-3-mini + RPO 84.61 Imported 2026-05-06
243 Qwen2.5-14B + RPO 84.58 Imported 2026-05-06
244 Qwen3-32B + SmoothLLM 84.52 Imported 2026-05-06
245 Doubao-1.5-pro + SmoothLLM 84.52 Imported 2026-05-06
246 Qwen3-30B-A3B + Baseline 84.44 Imported 2026-05-06
247 Phi-3-mini + PerplexityFilter 84.44 Imported 2026-05-06
248 Qwen3-30B-A3B + PerplexityFilter 84.39 Imported 2026-05-06
249 Llama-3.1-Tulu-3-8B + Semantic SmoothLLM 84.24 Imported 2026-05-06
250 Qwen2.5-1.5B + SelfReminder 84.23 Imported 2026-05-06
251 Phi-3-mini + Semantic SmoothLLM 84.23 Imported 2026-05-06
252 Qwen2.5-14B + PerplexityFilter 84.21 Imported 2026-05-06
253 Qwen2.5-14B + ICL 84.20 Imported 2026-05-06
254 Llama-3-1-405B + PerplexityFilter 84.17 Imported 2026-05-06
255 Llama-3.1-Tulu-3-8B + ICL 84.17 Imported 2026-05-06
256 Qwen3-14B + SmoothLLM 84.17 Imported 2026-05-06
257 Llama-3.1-Tulu-3-8B + Paraphrase 84.16 Imported 2026-05-06
258 GLM-4-plus + ICL 84.15 Imported 2026-05-06
259 Qwen2-72B + Semantic SmoothLLM 84.14 Imported 2026-05-06
260 Llama-3.1-70B + ICL 84.10 Imported 2026-05-06
261 GLM-4-plus + RPO 84.02 Imported 2026-05-06
262 DS-v3 + SelfReminder 83.98 Imported 2026-05-06
263 Qwen2.5-1.5B + Paraphrase 83.88 Imported 2026-05-06
264 Qwen3-14B + PerplexityFilter 83.86 Imported 2026-05-06
265 Gemini-2.0-flash-lite + SelfReminder 83.81 Imported 2026-05-06
266 DS-Llama-70b + PerplexityFilter 83.78 Imported 2026-05-06
267 Qwen2.5-14B + Baseline 83.77 Imported 2026-05-06
268 Gemma-2-2b-it + GoalPriority 83.74 Imported 2026-05-06
269 DS-Llama-70b + Paraphrase 83.67 Imported 2026-05-06
270 Llama-3.1-Tulu-3-8B + PerplexityFilter 83.67 Imported 2026-05-06
271 DS-r1 + GoalPriority 83.65 Imported 2026-05-06
272 Qwen3-14B + RPO 83.59 Imported 2026-05-06
273 Qwen2-7B + GoalPriority 83.50 Imported 2026-05-06
274 Qwen2.5-72B + Semantic SmoothLLM 83.48 Imported 2026-05-06
275 Qwen2.5-1.5B + RPO 83.45 Imported 2026-05-06
276 Qwen2.5-32B + SmoothLLM 83.41 Imported 2026-05-06
277 Claude-3-5-haiku + Semantic SmoothLLM 83.35 Imported 2026-05-06
278 Gemma-2-2b-it + SelfReminder 83.31 Imported 2026-05-06
279 GLM-4-plus + PerplexityFilter 83.30 Imported 2026-05-06
280 Kimi-latest + Baseline 83.30 Imported 2026-05-06
281 Kimi-latest + SmoothLLM 83.22 Imported 2026-05-06
282 Qwen2.5-32B + ICL 83.20 Imported 2026-05-06
283 Qwen3-32B + RPO 83.18 Imported 2026-05-06
284 Claude-3-5-sonnet + Semantic SmoothLLM 83.14 Imported 2026-05-06
285 Qwen2.5-1.5B + SmoothLLM 83.09 Imported 2026-05-06
286 DS-Llama-70b + RPO 83.05 Imported 2026-05-06
287 Qwen2.5-1.5B + GoalPriority 83.03 Imported 2026-05-06
288 Qwen2.5-1.5B + ICL 83.03 Imported 2026-05-06
289 Llama-3.1-Tulu-3-8B + Baseline 83 Imported 2026-05-06
290 Kimi-latest + RPO 82.99 Imported 2026-05-06
291 Llama-3.1-Tulu-3-8B + SmoothLLM 82.97 Imported 2026-05-06
292 Doubao-1.5-lite + SmoothLLM 82.95 Imported 2026-05-06
293 DS-v3-0324 + RPO 82.93 Imported 2026-05-06
294 Qwen2.5-1.5B + Semantic SmoothLLM 82.88 Imported 2026-05-06
295 Llama-3.3-70B + ICL 82.86 Imported 2026-05-06
296 o3-mini + Paraphrase 82.85 Imported 2026-05-06
297 GLM-4-plus + Baseline 82.80 Imported 2026-05-06
298 Gemma-2-2b-it + RPO 82.76 Imported 2026-05-06
299 Llama-3-1-405B + SmoothLLM 82.69 Imported 2026-05-06
300 Qwen2.5-0.5B + ICL 82.63 Imported 2026-05-06
301 GLM-4-flash + GoalPriority 82.61 Imported 2026-05-06
302 Qwen3-0.6B + Paraphrase 82.56 Imported 2026-05-06
303 Llama-3-1-405B + RPO 82.49 Imported 2026-05-06
304 Llama-3.1-70B + SmoothLLM 82.48 Imported 2026-05-06
305 Llama-3.1-8B + ICL 82.42 Imported 2026-05-06
306 Qwen3-8B + SmoothLLM 82.34 Imported 2026-05-06
307 Qwen2.5-32B + RPO 82.32 Imported 2026-05-06
308 GPT-4o-11-20 + Paraphrase 82.27 Imported 2026-05-06
309 Qwen3-14B + Baseline 82.27 Imported 2026-05-06
310 DS-r1 + SelfReminder 82.16 Imported 2026-05-06
311 Qwen2-7B + SelfReminder 82.14 Imported 2026-05-06
312 Qwen3-1.7B + GoalPriority 82.13 Imported 2026-05-06
313 Qwen3-14B + Semantic SmoothLLM 82.13 Imported 2026-05-06
314 DS-Llama-70b + SmoothLLM 82.11 Imported 2026-05-06
315 Doubao-pro + Paraphrase 82.06 Imported 2026-05-06
316 Gemini-2.0-flash + Paraphrase 82.05 Imported 2026-05-06
317 Qwen2.5-0.5B + Paraphrase 81.93 Imported 2026-05-06
318 Qwen2.5-1.5B + PerplexityFilter 81.86 Imported 2026-05-06
319 Qwen2.5-14B + Semantic SmoothLLM 81.74 Imported 2026-05-06
320 Llama-3.1-Tulu-3-70B + Semantic SmoothLLM 81.73 Imported 2026-05-06
321 Qwen3-8B + PerplexityFilter 81.71 Imported 2026-05-06
322 Llama-3.1-8B + RPO 81.70 Imported 2026-05-06
323 DS-2-1212 + GoalPriority 81.67 Imported 2026-05-06
324 Qwen3-8B + RPO 81.57 Imported 2026-05-06
325 Llama-3.3-70B + SelfReminder 81.56 Imported 2026-05-06
326 Gemma-2-2b-it + ICL 81.49 Imported 2026-05-06
327 Qwen3-32B + Semantic SmoothLLM 81.47 Imported 2026-05-06
328 Qwen2.5-32B + Semantic SmoothLLM 81.42 Imported 2026-05-06
329 Qwen2.5-32B + PerplexityFilter 81.38 Imported 2026-05-06
330 Gemma-2-2b-it + Semantic SmoothLLM 81.32 Imported 2026-05-06
331 Qwen2-7B + PerplexityFilter 81.29 Imported 2026-05-06
332 Qwen2.5-1.5B + Baseline 81.23 Imported 2026-05-06
333 DS-v3-0324 + SmoothLLM 81.23 Imported 2026-05-06
334 Gemini-2.0-pro + ICL 81.23 Imported 2026-05-06
335 Qwen3-8B + Baseline 81.20 Imported 2026-05-06
336 Qwen2-7B + Baseline 81.20 Imported 2026-05-06
337 DS-2-1212 + SelfReminder 81.19 Imported 2026-05-06
338 DS-Llama-70b + Baseline 81.16 Imported 2026-05-06
339 Qwen2-7B + RPO 81.10 Imported 2026-05-06
340 Qwen2.5-32B + Baseline 81.02 Imported 2026-05-06
341 Doubao-1.5-pro + Paraphrase 80.97 Imported 2026-05-06
342 GLM-4-plus + SmoothLLM 80.92 Imported 2026-05-06
343 Qwen3-4B + Semantic SmoothLLM 80.88 Imported 2026-05-06
344 Gemma-2-2b-it + PerplexityFilter 80.69 Imported 2026-05-06
345 Claude-3-7-sonnet + Semantic SmoothLLM 80.67 Imported 2026-05-06
346 DS-Llama-70b + Semantic SmoothLLM 80.66 Imported 2026-05-06
347 Qwen2-7B + Semantic SmoothLLM 80.64 Imported 2026-05-06
348 DS-v3-0324 + Baseline 80.60 Imported 2026-05-06
349 Llama-3.3-70B + SmoothLLM 80.60 Imported 2026-05-06
350 Qwen2-7B + ICL 80.60 Imported 2026-05-06
351 Phi-3-5-MoE + Paraphrase 80.58 Imported 2026-05-06
352 Qwen2.5-3B + SmoothLLM 80.57 Imported 2026-05-06
353 Doubao-1.5-lite + Paraphrase 80.50 Imported 2026-05-06
354 Kimi-latest + Semantic SmoothLLM 80.48 Imported 2026-05-06
355 Llama-3.3-70B + GoalPriority 80.41 Imported 2026-05-06
356 Llama-3.1-8B + SmoothLLM 80.33 Imported 2026-05-06
357 Gemma-2-2b-it + Paraphrase 80.32 Imported 2026-05-06
358 GPT-4o-08-06 + Paraphrase 80.32 Imported 2026-05-06
359 Qwen3-32B + PerplexityFilter 80.28 Imported 2026-05-06
360 Gemma-2-2b-it + Baseline 80.27 Imported 2026-05-06
361 Qwen2.5-3B + ICL 80.25 Imported 2026-05-06
362 Qwen2.5-7B + Semantic SmoothLLM 80.22 Imported 2026-05-06
363 Qwen2.5-3B + Semantic SmoothLLM 80.20 Imported 2026-05-06
364 Qwen2.5-7B + SelfReminder 80.20 Imported 2026-05-06
365 Qwen2-7B + SmoothLLM 80.20 Imported 2026-05-06
366 Qwen2.5-3B + RPO 80.18 Imported 2026-05-06
367 Qwen3-32B + Baseline 80.17 Imported 2026-05-06
368 Qwen2.5-7B + SmoothLLM 80.10 Imported 2026-05-06
369 Qwen2.5-7B + RPO 80.03 Imported 2026-05-06
370 Llama-3.1-8B + PerplexityFilter 79.99 Imported 2026-05-06
371 Gemini-2.0-flash + Baseline 79.94 Imported 2026-05-06
372 Qwen2-72B + Paraphrase 79.78 Imported 2026-05-06
373 Qwen2.5-0.5B + SelfReminder 79.77 Imported 2026-05-06
374 Llama-3.1-70B + PerplexityFilter 79.76 Imported 2026-05-06
375 Kimi-latest + Paraphrase 79.76 Imported 2026-05-06
376 Qwen3-1.7B + SelfReminder 79.75 Imported 2026-05-06
377 Doubao-lite + Semantic SmoothLLM 79.75 Imported 2026-05-06
378 Llama-3-1-405B + Baseline 79.72 Imported 2026-05-06
379 Qwen2.5-7B + GoalPriority 79.65 Imported 2026-05-06
380 Doubao-1.5-lite + Semantic SmoothLLM 79.64 Imported 2026-05-06
381 Gemini-2.0-pro + Baseline 79.58 Imported 2026-05-06
382 Gemini-2.0-flash-lite + Paraphrase 79.55 Imported 2026-05-06
383 Llama-3.3-70B + RPO 79.50 Imported 2026-05-06
384 Qwen2-7B + Paraphrase 79.48 Imported 2026-05-06
385 Qwen3-4B + SmoothLLM 79.41 Imported 2026-05-06
386 Qwen2.5-3B + Paraphrase 79.39 Imported 2026-05-06
387 GLM-4-flash + SelfReminder 79.27 Imported 2026-05-06
388 DS-r1 + ICL 79.26 Imported 2026-05-06
389 Doubao-1.5-pro + Semantic SmoothLLM 79.24 Imported 2026-05-06
390 Qwen2.5-72B + Paraphrase 79.18 Imported 2026-05-06
391 o3-mini + Semantic SmoothLLM 79.17 Imported 2026-05-06
392 Doubao-pro + Semantic SmoothLLM 79.07 Imported 2026-05-06
393 Qwen2.5-7B + ICL 79.03 Imported 2026-05-06
394 Llama-3.1-70B + Baseline 79.02 Imported 2026-05-06
395 Qwen2.5-7B + PerplexityFilter 78.99 Imported 2026-05-06
396 DS-v3 + ICL 78.97 Imported 2026-05-06
397 Gemini-2.0-flash + RPO 78.97 Imported 2026-05-06
398 Qwen2.5-14B + Paraphrase 78.95 Imported 2026-05-06
399 Gemini-2.0-flash + SmoothLLM 78.90 Imported 2026-05-06
400 Qwen2.5-3B + SelfReminder 78.76 Imported 2026-05-06
401 Qwen2.5-0.5B + PerplexityFilter 78.69 Imported 2026-05-06
402 DS-r1 + Paraphrase 78.59 Imported 2026-05-06
403 Llama-3.2-3B + Paraphrase 78.57 Imported 2026-05-06
404 Gemini-2.0-flash + ICL 78.51 Imported 2026-05-06
405 Qwen3-1.7B + ICL 78.50 Imported 2026-05-06
406 Qwen2.5-0.5B + RPO 78.36 Imported 2026-05-06
407 Qwen2.5-7B + Baseline 78.31 Imported 2026-05-06
408 Qwen2.5-7B + Paraphrase 78.31 Imported 2026-05-06
409 Qwen2.5-32B + Paraphrase 78.18 Imported 2026-05-06
410 Qwen3-1.7B + Semantic SmoothLLM 78.11 Imported 2026-05-06
411 DS-v3-0324 + Paraphrase 78.09 Imported 2026-05-06
412 Qwen3-4B + RPO 78.06 Imported 2026-05-06
413 Llama-3.1-8B + Semantic SmoothLLM 78.05 Imported 2026-05-06
414 Gemini-2.0-pro + RPO 78 Imported 2026-05-06
415 Gemini-2.0-flash + Semantic SmoothLLM 78 Imported 2026-05-06
416 Qwen2.5-3B + PerplexityFilter 77.94 Imported 2026-05-06
417 Qwen2.5-0.5B + SmoothLLM 77.93 Imported 2026-05-06
418 Qwen2.5-0.5B + Semantic SmoothLLM 77.90 Imported 2026-05-06
419 GPT-4o-mini + Paraphrase 77.83 Imported 2026-05-06
420 Llama-3.1-8B + Baseline 77.80 Imported 2026-05-06
421 Gemini-2.0-pro + Paraphrase 77.77 Imported 2026-05-06
422 Llama-3.2-1B + Paraphrase 77.68 Imported 2026-05-06
423 Qwen2.5-0.5B + GoalPriority 77.66 Imported 2026-05-06
424 Llama-3.1-8B + Paraphrase 77.65 Imported 2026-05-06
425 Gemini-2.0-pro + SmoothLLM 77.63 Imported 2026-05-06
426 Qwen2.5-3B + Baseline 77.45 Imported 2026-05-06
427 Qwen2.5-0.5B + Baseline 77.44 Imported 2026-05-06
428 DS-v3-0324 + Semantic SmoothLLM 77.24 Imported 2026-05-06
429 Qwen2.5-3B + GoalPriority 77.14 Imported 2026-05-06
430 Gemini-2.0-flash + PerplexityFilter 77.13 Imported 2026-05-06
431 DS-v3 + RPO 77.10 Imported 2026-05-06
432 Qwen3-4B + Baseline 76.83 Imported 2026-05-06
433 Qwen3-0.6B + Semantic SmoothLLM 76.81 Imported 2026-05-06
434 Qwen3-0.6B + SelfReminder 76.74 Imported 2026-05-06
435 Llama-3.2-3B + Semantic SmoothLLM 76.56 Imported 2026-05-06
436 Gemini-2.0-pro + Semantic SmoothLLM 76.54 Imported 2026-05-06
437 GLM-4-flash + Paraphrase 76.51 Imported 2026-05-06
438 Llama-3.2-1B + Semantic SmoothLLM 76.49 Imported 2026-05-06
439 GLM-4-flash + PerplexityFilter 76.43 Imported 2026-05-06
440 GLM-4-flash + ICL 76.33 Imported 2026-05-06
441 Gemini-2.0-flash-lite + Baseline 76.25 Imported 2026-05-06
442 Qwen3-0.6B + GoalPriority 76.15 Imported 2026-05-06
443 Gemini-2.0-flash-lite + RPO 76.14 Imported 2026-05-06
444 GLM-4-plus + Paraphrase 76.13 Imported 2026-05-06
445 Llama-3-1-405B + Semantic SmoothLLM 76.10 Imported 2026-05-06
446 Gemini-2.0-flash-lite + Semantic SmoothLLM 76.08 Imported 2026-05-06
447 Qwen3-1.7B + SmoothLLM 76.07 Imported 2026-05-06
448 DS-r1 + Semantic SmoothLLM 76.02 Imported 2026-05-06
449 Llama-3.3-70B + PerplexityFilter 75.97 Imported 2026-05-06
450 GLM-4-flash + RPO 75.95 Imported 2026-05-06
451 DS-v3 + Semantic SmoothLLM 75.84 Imported 2026-05-06
452 Qwen3-0.6B + ICL 75.78 Imported 2026-05-06
453 GLM-4-flash + Baseline 75.73 Imported 2026-05-06
454 Llama-3.3-70B + Baseline 75.64 Imported 2026-05-06
455 Qwen3-4B + PerplexityFilter 75.61 Imported 2026-05-06
456 GLM-4-flash + Semantic SmoothLLM 75.58 Imported 2026-05-06
457 Llama-3-1-405B + Paraphrase 75.56 Imported 2026-05-06
458 DS-v3 + Paraphrase 75.44 Imported 2026-05-06
459 GLM-4-plus + Semantic SmoothLLM 75.42 Imported 2026-05-06
460 Llama-3.3-70B + Paraphrase 75.40 Imported 2026-05-06
461 DS-v3 + SmoothLLM 75.39 Imported 2026-05-06
462 DS-v3 + PerplexityFilter 75.36 Imported 2026-05-06
463 DS-r1 + RPO 75.25 Imported 2026-05-06
464 GLM-4-flash + SmoothLLM 75.21 Imported 2026-05-06
465 DS-2-1212 + Paraphrase 75.19 Imported 2026-05-06
466 Gemini-2.0-flash-lite + SmoothLLM 75.11 Imported 2026-05-06
467 Llama-3.1-70B + Paraphrase 75.05 Imported 2026-05-06
468 Llama-3.3-70B + Semantic SmoothLLM 74.94 Imported 2026-05-06
469 Gemini-2.0-flash-lite + PerplexityFilter 74.75 Imported 2026-05-06
470 DS-r1 + SmoothLLM 74.56 Imported 2026-05-06
471 Gemini-2.0-flash-lite + ICL 74.55 Imported 2026-05-06
472 DS-v3 + Baseline 74.44 Imported 2026-05-06
473 Qwen3-0.6B + SmoothLLM 74.39 Imported 2026-05-06
474 Llama-3.1-70B + Semantic SmoothLLM 74.03 Imported 2026-05-06
475 DS-v3-0324 + PerplexityFilter 73.25 Imported 2026-05-06
476 DS-r1 + PerplexityFilter 73.08 Imported 2026-05-06
477 Qwen3-1.7B + RPO 72.73 Imported 2026-05-06
478 DS-2-1212 + Semantic SmoothLLM 72.68 Imported 2026-05-06
479 DS-r1 + Baseline 72.48 Imported 2026-05-06
480 Qwen3-0.6B + PerplexityFilter 72 Imported 2026-05-06
481 Qwen3-1.7B + PerplexityFilter 71.36 Imported 2026-05-06
482 Qwen3-1.7B + Baseline 70.91 Imported 2026-05-06
483 Qwen3-0.6B + RPO 70.65 Imported 2026-05-06
484 Qwen3-0.6B + Baseline 70.42 Imported 2026-05-06
485 DS-2-1212 + RPO 70.22 Imported 2026-05-06
486 DS-2-1212 + SmoothLLM 68.32 Imported 2026-05-06
487 DS-2-1212 + ICL 67.80 Imported 2026-05-06
488 DS-2-1212 + PerplexityFilter 67.72 Imported 2026-05-06
489 DS-2-1212 + Baseline 66.36 Imported 2026-05-06
490 Gemini-2.0-pro + PerplexityFilter 65 Imported 2026-05-06