AgentIF | BenchmarkList

Metadata

Constraint Success Rate, Instruction Success Rate, Vanilla, Condition, Example, Formatting, Semantic, Tool

Rank	Subject	Constraint Success Rate	Model Match	Provenance	Sampled
1	o1-mini	59.8	—	Imported	2026-05-27
2	GPT-4o	58.5	GPT-4o openai-gpt-4o	Imported	2026-05-27
3	Qwen3-32B	58.4	Qwen3 32B qwen-qwen3-32b	Imported	2026-05-27
4	QwQ-32B	58.1	—	Imported	2026-05-27
5	DeepSeek-R1	57.9	R1 deepseek-r1	Imported	2026-05-27
6	GLM-Z1-32B	57.8	—	Imported	2026-05-27
7	DeepSeek-V3	56.7	DeepSeek V3 deepseek-deepseek-chat	Imported	2026-05-27
8	Claude-3-5-Sonnet	56.6	Claude 3.5 Sonnet anthropic-claude-3.5-sonnet	Imported	2026-05-27
9	Meta-Llama-3.1-70B-Instruct	56.3	—	Imported	2026-05-27
10	DeepSeek-R1-Distill-Qwen-32B	55.1	R1 Distill Qwen 32B deepseek-deepseek-r1-distill-qwen-32b	Imported	2026-05-27
11	DeepSeek-R1-Distill-Llama-70B	55	R1 Distill Llama 70B deepseek-deepseek-r1-distill-llama-70b	Imported	2026-05-27
12	Meta-Llama-3.1-8B-Instruct	53.6	—	Imported	2026-05-27
13	Crab-DPO-7B	47.2	—	Imported	2026-05-27
14	Mistral-7B-Instruct-v0.3	46.8	—	Imported	2026-05-27
15	Conifer-DPO-7B	44.3	—	Imported	2026-05-27