CRMArena

CRMArena evaluates LLM agents on realistic customer relationship management tasks in a simulated Salesforce CRM organization across service agent, analyst, and manager personas.

24rows
overallprimary metric
2026-05-06sampled

Metadata

Metrics

Overall, New Case Routing, Handle Time Understanding, Transfer Count Understanding, Name Entity Disambiguation, Policy Violation Identification, Knowledge Question Answering, Top Issue Identification, Monthly Trend Analysis, Best Region Identification

Latest Results

Rank Subject Overall Model Match Provenance Sampled
1 o1 (Function Calling) 64.30 Imported 2026-05-06
2 o1 (ReAct) 57.70 Imported 2026-05-06
3 gpt-4o (Function Calling) 54.40 Imported 2026-05-06
4 llama3.1-405b (Function Calling) 51.30 Imported 2026-05-06
5 claude-3.5-sonnet (Function Calling) 41.80 Imported 2026-05-06
6 llama3.1-70b (Function Calling) 41.10 Imported 2026-05-06
7 gpt-4o (ReAct) 38.20 Imported 2026-05-06
8 claude-3.5-sonnet (Act) 37.40 Imported 2026-05-06
9 deepseek-r1 (ReAct) 35.10 Imported 2026-05-06
10 claude-3.5-sonnet (ReAct) 34.30 Imported 2026-05-06
11 llama3.1-405b (ReAct) 33.80 Imported 2026-05-06
12 gpt-4o (Act) 29.40 Imported 2026-05-06
13 gpt-4o-mini (ReAct) 28.30 Imported 2026-05-06
14 llama3.1-70b (ReAct) 27.80 Imported 2026-05-06
15 llama3.1-405b (Act) 22.20 Imported 2026-05-06
16 gpt-4o-mini (Function Calling) 19.50 Imported 2026-05-06
17 llama3.1-70b (Act) 18.60 Imported 2026-05-06
18 claude-3-sonnet (ReAct) 17.30 Imported 2026-05-06
19 gpt-4o-mini (Act) 16.70 Imported 2026-05-06
20 claude-3-sonnet (Act) 16.60 Imported 2026-05-06
21 claude-3-sonnet (Function Calling) 15.10 Imported 2026-05-06
22 deepseek-r1 (Function Calling) 9 Imported 2026-05-06
23 llama3.1-8b (ReAct) 3.10 Imported 2026-05-06
24 llama3.1-8b (Function Calling) 0 Imported 2026-05-06