BrowseComp-zh

A high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web, consisting of 289 multi-hop questions spanning 11 diverse domains including Film & TV, Technology, Medicine, and History. Questions are reverse-engineered from short, objective, and easily verifiable answers, requiring sophisticated reasoning and information reconciliation beyond basic retrieval. The benchmark addresses linguistic, infrastructural, and censorship-related complexities in Chinese web environments.

13rows
scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Score, Normalized Score

Latest Results

Rank Subject Score Model Match Provenance Sampled
1 Qwen3.5-397B-A17B 0.70 Qwen3.5 397B A17B
qwen-qwen3.5-397b-a17b
Self-reported 2026-05-06
2 Qwen3.5-122B-A10B 0.70 Qwen3.5-122B-A10B
qwen-qwen3.5-122b-a10b
Self-reported 2026-05-06
3 Qwen3.5-35B-A3B 0.69 Qwen3.5-35B-A3B
qwen-qwen3.5-35b-a3b
Self-reported 2026-05-06
4 LongCat-Flash-Thinking-2601 0.69 Self-reported 2026-05-06
5 GLM-4.7 0.67 GLM GLM 4.7
z-ai-glm-4.7
Self-reported 2026-05-06
6 DeepSeek-V3.2 0.65 DeepSeek V3.2
deepseek-deepseek-v3.2
Self-reported 2026-05-06
6 DeepSeek-V3.2 (Thinking) 0.65 R1
deepseek-r1
Self-reported 2026-05-06
8 Kimi K2-Thinking-0905 0.62 KIMI MoonshotAI: Kimi K2 Thinking
moonshotai-kimi-k2-thinking
Self-reported 2026-05-06
9 Qwen3.5-27B 0.62 Qwen3.5-27B
qwen-qwen3.5-27b
Self-reported 2026-05-06
10 DeepSeek-V3.1 0.49 DeepSeek V3.1
deepseek-deepseek-chat-v3.1
Self-reported 2026-05-06
11 MiniMax M2 0.48 MiniMax M2
minimax-minimax-m2
Self-reported 2026-05-06
12 DeepSeek-V3.2-Exp 0.48 DeepSeek V3.2 Exp
deepseek-deepseek-v3.2-exp
Self-reported 2026-05-06
13 DeepSeek-R1-0528 0.36 R1 0528
deepseek-deepseek-r1-0528
Self-reported 2026-05-06