VitaBench

Interactive real-world applications benchmark for LLM agents across delivery, in-store consumption, online travel, and cross-scenario tasks with 66 tools and multi-turn user interactions.

31rows
cross_scenarios_avg_at_4primary metric
2026-05-28sampled

Metadata

Metrics

Cross-Scenarios Avg@4, Cross-Scenarios Pass@4, Cross-Scenarios Pass^4, Delivery Avg@4, Delivery Pass@4, Delivery Pass^4, In-store Avg@4, In-store Pass@4, In-store Pass^4, OTA Avg@4, OTA Pass@4, OTA Pass^4

Showing 2 latest source slices.

Latest Results

Provider-published Qwen3.7-Max comparison scores. Rows are marked self-reported and should be interpreted as source claims unless independently reproduced.

Rank Subject Cross-Scenarios Avg@4 Model Match Provenance Sampled
1 DeepSeek V4 Pro Max 51.9% DeepSeek V4 Pro
deepseek-deepseek-v4-pro
Self-reported 2026-05-28
2 Qwen3.7 Max 47.9% Qwen3.7 Max
qwen-qwen3.7-max
Self-reported 2026-05-28
3 GLM-5.1 Thinking 45.1% GLM GLM 5.1
z-ai-glm-5.1
Self-reported 2026-05-28
4 Qwen3.6 Plus 42.8% Qwen3.6 Plus
qwen-qwen3.6-plus
Self-reported 2026-05-28
5 Kimi K2.6 Thinking 39.1% KIMI MoonshotAI: Kimi K2.6
moonshotai-kimi-k2.6
Self-reported 2026-05-28
1 Gemini-3-Flash (high) 32.50 Gemini 3 Flash Preview
google-gemini-3-flash-preview
Imported 2026-05-06
2 Gemini-3-Pro (high) 31.50 Gemini 3
google-gemini-3
Imported 2026-05-06
3 LongCat-Flash-Thinking-2601 29.30 Imported 2026-05-06
4 Claude-4.5-Opus 28.50 Imported 2026-05-06
5 o3 (high) 26.30 o3
openai-o3
Imported 2026-05-06
6 GPT-5.2 (xhigh) 24.30 GPT-5.2
openai-gpt-5.2
Imported 2026-05-06
7 DeepSeek-V3.2 24 DeepSeek V3.2
deepseek-deepseek-v3.2
Imported 2026-05-06
8 Claude-4.5-Sonnet 23.50 Imported 2026-05-06
9 o4-mini (high) 19.50 o4 Mini
openai-o4-mini
Imported 2026-05-06
10 Doubao-Seed-1.8-Thinking (high) 18.80 Imported 2026-05-06
11 GLM-4.7 18.30 GLM GLM 4.7
z-ai-glm-4.7
Imported 2026-05-06
12 Qwen3-235B-A22B-Thinking-2507 14.50 Qwen3 235B A22B Thinking 2507
qwen-qwen3-235b-a22b-thinking-2507
Imported 2026-05-06
13 Kimi-K2-Thinking 12.80 KIMI MoonshotAI: Kimi K2 Thinking
moonshotai-kimi-k2-thinking
Imported 2026-05-06
14 Qwen3-32B 5.30 Qwen3 32B
qwen-qwen3-32b
Imported 2026-05-06
15 Gemini-3-Pro (low) 30 Gemini 3
google-gemini-3
Imported 2026-05-06
16 Claude-4.5-Opus 23.30 Imported 2026-05-06
17 LongCat-Flash-Chat 22.80 Imported 2026-05-06
18 DeepSeek-V3.2 18.50 DeepSeek V3.2
deepseek-deepseek-v3.2
Imported 2026-05-06
19 Claude-4.5-Sonnet 17 Imported 2026-05-06
20 GLM-4.7 15.50 GLM GLM 4.7
z-ai-glm-4.7
Imported 2026-05-06
21 Qwen3-Max 14.30 Qwen3 Max
qwen-qwen3-max
Imported 2026-05-06
22 Doubao-Seed-1.8 13.80 Imported 2026-05-06
23 Qwen3-235B-A22B-Instruct-2507 12.30 Qwen3 235B A22B Instruct 2507
qwen-qwen3-235b-a22b-2507
Imported 2026-05-06
24 Kimi-K2-0905 11.50 KIMI MoonshotAI: Kimi K2 0905
moonshotai-kimi-k2-0905
Imported 2026-05-06
25 Qwen3-32B 4 Qwen3 32B
qwen-qwen3-32b
Imported 2026-05-06
26 GPT-5.2 (none) 0.80 GPT-5.2
openai-gpt-5.2
Imported 2026-05-06