CAR-bench

Automotive in-car assistant benchmark evaluating multi-turn tool-using LLM agents under uncertainty, ambiguity, missing capabilities, and domain policy constraints.

12rows
avg_pass_3primary metric
2026-05-06sampled

Metadata

Metrics

Avg Pass^3, Base Pass^1, Base Pass^3, Base Pass@3, Hallucination Pass^1, Hallucination Pass^3, Hallucination Pass@3, Disambiguation Pass^1, Disambiguation Pass^3, Disambiguation Pass@3

Latest Results

Rows are parsed from the CAR-bench README baseline results table. Source model names and proprietary/open group labels are preserved.

Rank Subject Avg Pass^3 Model Match Provenance Sampled
1 Claude-Opus-4.6 (auto-thinking) 0.58 Claude Opus 4.6
anthropic-claude-opus-4.6
Imported 2026-05-06
2 GPT-5 (thinking) 0.54 GPT-5
openai-gpt-5
Imported 2026-05-06
3 GPT-5.2 (thinking) 0.53 GPT-5.2
openai-gpt-5.2
Imported 2026-05-06
4 Claude-Opus-4.5 (thinking) 0.52 Claude Opus 4.5
anthropic-claude-opus-4.5
Imported 2026-05-06
5 Claude-Sonnet-4 (thinking) 0.47 Claude Sonnet 4
anthropic-claude-sonnet-4
Imported 2026-05-06
6 Gemini-2.5-flash (thinking) 0.41 Gemini 2.5 Flash
google-gemini-2.5-flash
Imported 2026-05-06
7 Gemini-2.5-pro (auto-thinking) 0.38 Gemini 2.5 Pro
google-gemini-2.5-pro
Imported 2026-05-06
8 GPT-4.1 0.37 GPT-4.1
openai-gpt-4.1
Imported 2026-05-06
9 Gemini-2.5-flash 0.34 Gemini 2.5 Flash
google-gemini-2.5-flash
Imported 2026-05-06
10 Qwen3-32b (thinking) 0.31 Qwen3 32B
qwen-qwen3-32b
Imported 2026-05-06
11 GPT-Oss-120b (thinking) 0.28 gpt-oss-120b
openai-gpt-oss-120b
Imported 2026-05-06
12 xLAM-2-32b 0.16 Imported 2026-05-06