TERMS-Bench
Simulator-based bilateral price negotiation benchmark for LLM agents, measuring surplus extraction, feasible agreement calibration, belief error, and procedural compliance without an LLM judge.
15rows
se_plusprimary metric
2026-05-28sampled
Metadata
Metrics
SE+ (Feasible Surplus Efficiency), AGR+ (Feasible Agreement Rate), CSE+ (Conditional Feasible Surplus Efficiency), FAGR- (No-Deal False Agreement Rate) (lower is better), Safe Termination Rate, BE_type (Belief Error) (lower is better), Stance Accuracy, Critical Violation Rate (lower is better), Mean Utility, Conditional Utility
| Rank | Subject | SE+ (Feasible Surplus Efficiency) | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.6 | 69.4% SE+ | Claude Opus 4.6 anthropic-claude-opus-4.6 | Imported | 2026-05-28 |
| 2 | GLM 5.1 | 68.6% SE+ | GLM 5.1 z-ai-glm-5.1 | Imported | 2026-05-28 |
| 3 | Claude Opus 4.7 | 66.0% SE+ | Claude Opus 4.7 anthropic-claude-opus-4.7 | Imported | 2026-05-28 |
| 4 | Gemma 4 31B | 64.0% SE+ | Gemma 4 31B google-gemma-4-31b-it | Imported | 2026-05-28 |
| 5 | Gemini 3.1 Pro | 63.9% SE+ | Gemini 3.1 Pro Preview google-gemini-3.1-pro-preview | Imported | 2026-05-28 |
| 6 | DeepSeek V4 Pro | 61.8% SE+ | DeepSeek V4 Pro deepseek-deepseek-v4-pro | Imported | 2026-05-28 |
| 7 | GPT-5.5 | 60.6% SE+ | GPT-5.5 openai-gpt-5.5 | Imported | 2026-05-28 |
| 8 | Qwen 3.6 Plus | 60.4% SE+ | Qwen3.6 Plus qwen-qwen3.6-plus | Imported | 2026-05-28 |
| 9 | Grok 4.20 | 60.1% SE+ | Grok 4.20 x-ai-grok-4.20 | Imported | 2026-05-28 |
| 10 | Kimi K2.6 | 59.7% SE+ | MoonshotAI: Kimi K2.6 moonshotai-kimi-k2.6 | Imported | 2026-05-28 |
| 11 | Doubao Seed 2.0 Pro | 52.2% SE+ | — | Imported | 2026-05-28 |
| 12 | Fixed 0.30 | 38.7% SE+ | — | Imported | 2026-05-28 |
| 13 | Fixed 0.10 | 29.0% SE+ | — | Imported | 2026-05-28 |
| 14 | Fixed 0.01 | 27.3% SE+ | — | Imported | 2026-05-28 |
| 15 | GPT-4o mini | 18.9% SE+ | GPT-4o-mini openai-gpt-4o-mini | Imported | 2026-05-28 |
No matching rows.