TERMS-Bench | BenchmarkList

Metadata

ID: terms_bench
Category: Agentic
Release: 2026-05-13
Source: Source page
Snapshot: Snapshot source
Post: Announcement post

Metrics

SE+ (Feasible Surplus Efficiency), AGR+ (Feasible Agreement Rate), CSE+ (Conditional Feasible Surplus Efficiency), FAGR- (No-Deal False Agreement Rate) (lower is better), Safe Termination Rate, BE_type (Belief Error) (lower is better), Stance Accuracy, Critical Violation Rate (lower is better), Mean Utility, Conditional Utility

Rank	Subject	SE+ (Feasible Surplus Efficiency)	Model Match	Provenance	Sampled
1	Claude Opus 4.6	69.4% SE+	Claude Opus 4.6 anthropic-claude-opus-4.6	Imported	2026-05-28
2	GLM 5.1	68.6% SE+	GLM GLM 5.1 z-ai-glm-5.1	Imported	2026-05-28
3	Claude Opus 4.7	66.0% SE+	Claude Opus 4.7 anthropic-claude-opus-4.7	Imported	2026-05-28
4	Gemma 4 31B	64.0% SE+	Gemma 4 31B google-gemma-4-31b-it	Imported	2026-05-28
5	Gemini 3.1 Pro	63.9% SE+	Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview	Imported	2026-05-28
6	DeepSeek V4 Pro	61.8% SE+	DeepSeek V4 Pro deepseek-deepseek-v4-pro	Imported	2026-05-28
7	GPT-5.5	60.6% SE+	GPT-5.5 openai-gpt-5.5	Imported	2026-05-28
8	Qwen 3.6 Plus	60.4% SE+	Qwen3.6 Plus qwen-qwen3.6-plus	Imported	2026-05-28
9	Grok 4.20	60.1% SE+	GROK Grok 4.20 x-ai-grok-4.20	Imported	2026-05-28
10	Kimi K2.6	59.7% SE+	KIMI MoonshotAI: Kimi K2.6 moonshotai-kimi-k2.6	Imported	2026-05-28
11	Doubao Seed 2.0 Pro	52.2% SE+	—	Imported	2026-05-28
12	Fixed 0.30	38.7% SE+	—	Imported	2026-05-28
13	Fixed 0.10	29.0% SE+	—	Imported	2026-05-28
14	Fixed 0.01	27.3% SE+	—	Imported	2026-05-28
15	GPT-4o mini	18.9% SE+	GPT-4o-mini openai-gpt-4o-mini	Imported	2026-05-28

Metadata

Metrics

Latest Results