BrowseComp-zh
A high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web, consisting of 289 multi-hop questions spanning 11 diverse domains including Film & TV, Technology, Medicine, and History. Questions are reverse-engineered from short, objective, and easily verifiable answers, requiring sophisticated reasoning and information reconciliation beyond basic retrieval. The benchmark addresses linguistic, infrastructural, and censorship-related complexities in Chinese web environments.
13rows
scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Score, Normalized Score
| Rank | Subject | Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Qwen3.5-397B-A17B | 0.70 | Qwen3.5 397B A17B qwen-qwen3.5-397b-a17b | Self-reported | 2026-05-06 |
| 2 | Qwen3.5-122B-A10B | 0.70 | Qwen3.5-122B-A10B qwen-qwen3.5-122b-a10b | Self-reported | 2026-05-06 |
| 3 | Qwen3.5-35B-A3B | 0.69 | Qwen3.5-35B-A3B qwen-qwen3.5-35b-a3b | Self-reported | 2026-05-06 |
| 4 | LongCat-Flash-Thinking-2601 | 0.69 | — | Self-reported | 2026-05-06 |
| 5 | GLM-4.7 | 0.67 | GLM 4.7 z-ai-glm-4.7 | Self-reported | 2026-05-06 |
| 6 | DeepSeek-V3.2 | 0.65 | DeepSeek V3.2 deepseek-deepseek-v3.2 | Self-reported | 2026-05-06 |
| 6 | DeepSeek-V3.2 (Thinking) | 0.65 | R1 deepseek-r1 | Self-reported | 2026-05-06 |
| 8 | Kimi K2-Thinking-0905 | 0.62 | MoonshotAI: Kimi K2 Thinking moonshotai-kimi-k2-thinking | Self-reported | 2026-05-06 |
| 9 | Qwen3.5-27B | 0.62 | Qwen3.5-27B qwen-qwen3.5-27b | Self-reported | 2026-05-06 |
| 10 | DeepSeek-V3.1 | 0.49 | DeepSeek V3.1 deepseek-deepseek-chat-v3.1 | Self-reported | 2026-05-06 |
| 11 | MiniMax M2 | 0.48 | MiniMax M2 minimax-minimax-m2 | Self-reported | 2026-05-06 |
| 12 | DeepSeek-V3.2-Exp | 0.48 | DeepSeek V3.2 Exp deepseek-deepseek-v3.2-exp | Self-reported | 2026-05-06 |
| 13 | DeepSeek-R1-0528 | 0.36 | R1 0528 deepseek-deepseek-r1-0528 | Self-reported | 2026-05-06 |
No matching rows.