CAR-bench
Automotive in-car assistant benchmark evaluating multi-turn tool-using LLM agents under uncertainty, ambiguity, missing capabilities, and domain policy constraints.
12rows
avg_pass_3primary metric
2026-05-06sampled
Metadata
Metrics
Avg Pass^3, Base Pass^1, Base Pass^3, Base Pass@3, Hallucination Pass^1, Hallucination Pass^3, Hallucination Pass@3, Disambiguation Pass^1, Disambiguation Pass^3, Disambiguation Pass@3
| Rank | Subject | Avg Pass^3 | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude-Opus-4.6 (auto-thinking) | 0.58 | Claude Opus 4.6 anthropic-claude-opus-4.6 | Imported | 2026-05-06 |
| 2 | GPT-5 (thinking) | 0.54 | GPT-5 openai-gpt-5 | Imported | 2026-05-06 |
| 3 | GPT-5.2 (thinking) | 0.53 | GPT-5.2 openai-gpt-5.2 | Imported | 2026-05-06 |
| 4 | Claude-Opus-4.5 (thinking) | 0.52 | Claude Opus 4.5 anthropic-claude-opus-4.5 | Imported | 2026-05-06 |
| 5 | Claude-Sonnet-4 (thinking) | 0.47 | Claude Sonnet 4 anthropic-claude-sonnet-4 | Imported | 2026-05-06 |
| 6 | Gemini-2.5-flash (thinking) | 0.41 | Gemini 2.5 Flash google-gemini-2.5-flash | Imported | 2026-05-06 |
| 7 | Gemini-2.5-pro (auto-thinking) | 0.38 | Gemini 2.5 Pro google-gemini-2.5-pro | Imported | 2026-05-06 |
| 8 | GPT-4.1 | 0.37 | GPT-4.1 openai-gpt-4.1 | Imported | 2026-05-06 |
| 9 | Gemini-2.5-flash | 0.34 | Gemini 2.5 Flash google-gemini-2.5-flash | Imported | 2026-05-06 |
| 10 | Qwen3-32b (thinking) | 0.31 | Qwen3 32B qwen-qwen3-32b | Imported | 2026-05-06 |
| 11 | GPT-Oss-120b (thinking) | 0.28 | gpt-oss-120b openai-gpt-oss-120b | Imported | 2026-05-06 |
| 12 | xLAM-2-32b | 0.16 | — | Imported | 2026-05-06 |
No matching rows.