CAR-bench | BenchmarkList

Metadata

ID: car_bench
Category: Agentic
Release: 2026-01-29
Source: Source page
Snapshot: Snapshot source
Post: Announcement post

Metrics

Avg Pass^3, Base Pass^1, Base Pass^3, Base Pass@3, Hallucination Pass^1, Hallucination Pass^3, Hallucination Pass@3, Disambiguation Pass^1, Disambiguation Pass^3, Disambiguation Pass@3

Rank	Subject	Avg Pass^3	Model Match	Provenance	Sampled
1	Claude-Opus-4.6 (auto-thinking)	0.58	Claude Opus 4.6 anthropic-claude-opus-4.6	Imported	2026-05-06
2	GPT-5 (thinking)	0.54	GPT-5 openai-gpt-5	Imported	2026-05-06
3	GPT-5.2 (thinking)	0.53	GPT-5.2 openai-gpt-5.2	Imported	2026-05-06
4	Claude-Opus-4.5 (thinking)	0.52	Claude Opus 4.5 anthropic-claude-opus-4.5	Imported	2026-05-06
5	Claude-Sonnet-4 (thinking)	0.47	Claude Sonnet 4 anthropic-claude-sonnet-4	Imported	2026-05-06
6	Gemini-2.5-flash (thinking)	0.41	Gemini 2.5 Flash google-gemini-2.5-flash	Imported	2026-05-06
7	Gemini-2.5-pro (auto-thinking)	0.38	Gemini 2.5 Pro google-gemini-2.5-pro	Imported	2026-05-06
8	GPT-4.1	0.37	GPT-4.1 openai-gpt-4.1	Imported	2026-05-06
9	Gemini-2.5-flash	0.34	Gemini 2.5 Flash google-gemini-2.5-flash	Imported	2026-05-06
10	Qwen3-32b (thinking)	0.31	Qwen3 32B qwen-qwen3-32b	Imported	2026-05-06
11	GPT-Oss-120b (thinking)	0.28	gpt-oss-120b openai-gpt-oss-120b	Imported	2026-05-06
12	xLAM-2-32b	0.16	—	Imported	2026-05-06