AgentIF

Agent instruction-following benchmark measuring constraint and instruction success across vanilla, condition, example, formatting, semantic, and tool constraints.

15rows
constraint_success_rateprimary metric
2026-05-27sampled

Metadata

Metrics

Constraint Success Rate, Instruction Success Rate, Vanilla, Condition, Example, Formatting, Semantic, Tool

Latest Results

Rows parsed from the public AgentIF project-page leaderboard. The primary score is constraint success rate, with instruction success and constraint-category breakdowns preserved.

Rank Subject Constraint Success Rate Model Match Provenance Sampled
1 o1-mini 59.8 Imported 2026-05-27
2 GPT-4o 58.5 GPT-4o
openai-gpt-4o
Imported 2026-05-27
3 Qwen3-32B 58.4 Qwen3 32B
qwen-qwen3-32b
Imported 2026-05-27
4 QwQ-32B 58.1 Imported 2026-05-27
5 DeepSeek-R1 57.9 R1
deepseek-r1
Imported 2026-05-27
6 GLM-Z1-32B 57.8 Imported 2026-05-27
7 DeepSeek-V3 56.7 DeepSeek V3
deepseek-deepseek-chat
Imported 2026-05-27
8 Claude-3-5-Sonnet 56.6 Claude 3.5 Sonnet
anthropic-claude-3.5-sonnet
Imported 2026-05-27
9 Meta-Llama-3.1-70B-Instruct 56.3 Imported 2026-05-27
10 DeepSeek-R1-Distill-Qwen-32B 55.1 R1 Distill Qwen 32B
deepseek-deepseek-r1-distill-qwen-32b
Imported 2026-05-27
11 DeepSeek-R1-Distill-Llama-70B 55 R1 Distill Llama 70B
deepseek-deepseek-r1-distill-llama-70b
Imported 2026-05-27
12 Meta-Llama-3.1-8B-Instruct 53.6 Imported 2026-05-27
13 Crab-DPO-7B 47.2 Imported 2026-05-27
14 Mistral-7B-Instruct-v0.3 46.8 Imported 2026-05-27
15 Conifer-DPO-7B 44.3 Imported 2026-05-27