VitaBench
Interactive real-world applications benchmark for LLM agents across delivery, in-store consumption, online travel, and cross-scenario tasks with 66 tools and multi-turn user interactions.
31rows
cross_scenarios_avg_at_4primary metric
2026-05-28sampled
Metadata
Metrics
Cross-Scenarios Avg@4, Cross-Scenarios Pass@4, Cross-Scenarios Pass^4, Delivery Avg@4, Delivery Pass@4, Delivery Pass^4, In-store Avg@4, In-store Pass@4, In-store Pass^4, OTA Avg@4, OTA Pass@4, OTA Pass^4
Showing 2 latest source slices.
| Rank | Subject | Cross-Scenarios Avg@4 | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | DeepSeek V4 Pro Max | 51.9% | DeepSeek V4 Pro deepseek-deepseek-v4-pro | Self-reported | 2026-05-28 |
| 2 | Qwen3.7 Max | 47.9% | Qwen3.7 Max qwen-qwen3.7-max | Self-reported | 2026-05-28 |
| 3 | GLM-5.1 Thinking | 45.1% | GLM 5.1 z-ai-glm-5.1 | Self-reported | 2026-05-28 |
| 4 | Qwen3.6 Plus | 42.8% | Qwen3.6 Plus qwen-qwen3.6-plus | Self-reported | 2026-05-28 |
| 5 | Kimi K2.6 Thinking | 39.1% | MoonshotAI: Kimi K2.6 moonshotai-kimi-k2.6 | Self-reported | 2026-05-28 |
| 1 | Gemini-3-Flash (high) | 32.50 | Gemini 3 Flash Preview google-gemini-3-flash-preview | Imported | 2026-05-06 |
| 2 | Gemini-3-Pro (high) | 31.50 | Gemini 3 google-gemini-3 | Imported | 2026-05-06 |
| 3 | LongCat-Flash-Thinking-2601 | 29.30 | — | Imported | 2026-05-06 |
| 4 | Claude-4.5-Opus | 28.50 | — | Imported | 2026-05-06 |
| 5 | o3 (high) | 26.30 | o3 openai-o3 | Imported | 2026-05-06 |
| 6 | GPT-5.2 (xhigh) | 24.30 | GPT-5.2 openai-gpt-5.2 | Imported | 2026-05-06 |
| 7 | DeepSeek-V3.2 | 24 | DeepSeek V3.2 deepseek-deepseek-v3.2 | Imported | 2026-05-06 |
| 8 | Claude-4.5-Sonnet | 23.50 | — | Imported | 2026-05-06 |
| 9 | o4-mini (high) | 19.50 | o4 Mini openai-o4-mini | Imported | 2026-05-06 |
| 10 | Doubao-Seed-1.8-Thinking (high) | 18.80 | — | Imported | 2026-05-06 |
| 11 | GLM-4.7 | 18.30 | GLM 4.7 z-ai-glm-4.7 | Imported | 2026-05-06 |
| 12 | Qwen3-235B-A22B-Thinking-2507 | 14.50 | Qwen3 235B A22B Thinking 2507 qwen-qwen3-235b-a22b-thinking-2507 | Imported | 2026-05-06 |
| 13 | Kimi-K2-Thinking | 12.80 | MoonshotAI: Kimi K2 Thinking moonshotai-kimi-k2-thinking | Imported | 2026-05-06 |
| 14 | Qwen3-32B | 5.30 | Qwen3 32B qwen-qwen3-32b | Imported | 2026-05-06 |
| 15 | Gemini-3-Pro (low) | 30 | Gemini 3 google-gemini-3 | Imported | 2026-05-06 |
| 16 | Claude-4.5-Opus | 23.30 | — | Imported | 2026-05-06 |
| 17 | LongCat-Flash-Chat | 22.80 | — | Imported | 2026-05-06 |
| 18 | DeepSeek-V3.2 | 18.50 | DeepSeek V3.2 deepseek-deepseek-v3.2 | Imported | 2026-05-06 |
| 19 | Claude-4.5-Sonnet | 17 | — | Imported | 2026-05-06 |
| 20 | GLM-4.7 | 15.50 | GLM 4.7 z-ai-glm-4.7 | Imported | 2026-05-06 |
| 21 | Qwen3-Max | 14.30 | Qwen3 Max qwen-qwen3-max | Imported | 2026-05-06 |
| 22 | Doubao-Seed-1.8 | 13.80 | — | Imported | 2026-05-06 |
| 23 | Qwen3-235B-A22B-Instruct-2507 | 12.30 | Qwen3 235B A22B Instruct 2507 qwen-qwen3-235b-a22b-2507 | Imported | 2026-05-06 |
| 24 | Kimi-K2-0905 | 11.50 | MoonshotAI: Kimi K2 0905 moonshotai-kimi-k2-0905 | Imported | 2026-05-06 |
| 25 | Qwen3-32B | 4 | Qwen3 32B qwen-qwen3-32b | Imported | 2026-05-06 |
| 26 | GPT-5.2 (none) | 0.80 | GPT-5.2 openai-gpt-5.2 | Imported | 2026-05-06 |
No matching rows.