AgentIF
Agent instruction-following benchmark measuring constraint and instruction success across vanilla, condition, example, formatting, semantic, and tool constraints.
15rows
constraint_success_rateprimary metric
2026-05-27sampled
Metadata
Metrics
Constraint Success Rate, Instruction Success Rate, Vanilla, Condition, Example, Formatting, Semantic, Tool
| Rank | Subject | Constraint Success Rate | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | o1-mini | 59.8 | — | Imported | 2026-05-27 |
| 2 | GPT-4o | 58.5 | GPT-4o openai-gpt-4o | Imported | 2026-05-27 |
| 3 | Qwen3-32B | 58.4 | Qwen3 32B qwen-qwen3-32b | Imported | 2026-05-27 |
| 4 | QwQ-32B | 58.1 | — | Imported | 2026-05-27 |
| 5 | DeepSeek-R1 | 57.9 | R1 deepseek-r1 | Imported | 2026-05-27 |
| 6 | GLM-Z1-32B | 57.8 | — | Imported | 2026-05-27 |
| 7 | DeepSeek-V3 | 56.7 | DeepSeek V3 deepseek-deepseek-chat | Imported | 2026-05-27 |
| 8 | Claude-3-5-Sonnet | 56.6 | Claude 3.5 Sonnet anthropic-claude-3.5-sonnet | Imported | 2026-05-27 |
| 9 | Meta-Llama-3.1-70B-Instruct | 56.3 | — | Imported | 2026-05-27 |
| 10 | DeepSeek-R1-Distill-Qwen-32B | 55.1 | R1 Distill Qwen 32B deepseek-deepseek-r1-distill-qwen-32b | Imported | 2026-05-27 |
| 11 | DeepSeek-R1-Distill-Llama-70B | 55 | R1 Distill Llama 70B deepseek-deepseek-r1-distill-llama-70b | Imported | 2026-05-27 |
| 12 | Meta-Llama-3.1-8B-Instruct | 53.6 | — | Imported | 2026-05-27 |
| 13 | Crab-DPO-7B | 47.2 | — | Imported | 2026-05-27 |
| 14 | Mistral-7B-Instruct-v0.3 | 46.8 | — | Imported | 2026-05-27 |
| 15 | Conifer-DPO-7B | 44.3 | — | Imported | 2026-05-27 |
No matching rows.