TERMS-Bench

Simulator-based bilateral price negotiation benchmark for LLM agents, measuring surplus extraction, feasible agreement calibration, belief error, and procedural compliance without an LLM judge.

15rows
se_plusprimary metric
2026-05-28sampled

Metadata

Metrics

SE+ (Feasible Surplus Efficiency), AGR+ (Feasible Agreement Rate), CSE+ (Conditional Feasible Surplus Efficiency), FAGR- (No-Deal False Agreement Rate) (lower is better), Safe Termination Rate, BE_type (Belief Error) (lower is better), Stance Accuracy, Critical Violation Rate (lower is better), Mean Utility, Conditional Utility

Latest Results

Rows are imported from the TERMS-Bench static project-site data.js payload and ranked by the Overall SE+ metric.

Rank Subject SE+ (Feasible Surplus Efficiency) Model Match Provenance Sampled
1 Claude Opus 4.6 69.4% SE+ Claude Opus 4.6
anthropic-claude-opus-4.6
Imported 2026-05-28
2 GLM 5.1 68.6% SE+ GLM GLM 5.1
z-ai-glm-5.1
Imported 2026-05-28
3 Claude Opus 4.7 66.0% SE+ Claude Opus 4.7
anthropic-claude-opus-4.7
Imported 2026-05-28
4 Gemma 4 31B 64.0% SE+ Gemma 4 31B
google-gemma-4-31b-it
Imported 2026-05-28
5 Gemini 3.1 Pro 63.9% SE+ Gemini 3.1 Pro Preview
google-gemini-3.1-pro-preview
Imported 2026-05-28
6 DeepSeek V4 Pro 61.8% SE+ DeepSeek V4 Pro
deepseek-deepseek-v4-pro
Imported 2026-05-28
7 GPT-5.5 60.6% SE+ GPT-5.5
openai-gpt-5.5
Imported 2026-05-28
8 Qwen 3.6 Plus 60.4% SE+ Qwen3.6 Plus
qwen-qwen3.6-plus
Imported 2026-05-28
9 Grok 4.20 60.1% SE+ GROK Grok 4.20
x-ai-grok-4.20
Imported 2026-05-28
10 Kimi K2.6 59.7% SE+ KIMI MoonshotAI: Kimi K2.6
moonshotai-kimi-k2.6
Imported 2026-05-28
11 Doubao Seed 2.0 Pro 52.2% SE+ Imported 2026-05-28
12 Fixed 0.30 38.7% SE+ Imported 2026-05-28
13 Fixed 0.10 29.0% SE+ Imported 2026-05-28
14 Fixed 0.01 27.3% SE+ Imported 2026-05-28
15 GPT-4o mini 18.9% SE+ GPT-4o-mini
openai-gpt-4o-mini
Imported 2026-05-28