BrowseComp-zh

A high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web, consisting of 289 multi-hop questions spanning 11 diverse domains including Film & TV, Technology, Medicine, and History. Questions are reverse-engineered from short, objective, and easily verifiable answers, requiring sophisticated reasoning and information reconciliation beyond basic retrieval. The benchmark addresses linguistic, infrastructural, and censorship-related complexities in Chinese web environments.

13rows

scoreprimary metric

2026-05-06sampled

Metadata

ID: browsecomp_zh
Category: Search
Release: 2025-04-27
Source: Source page
Snapshot: Snapshot source
Post: Announcement post

Metrics

Score, Normalized Score

Rank	Subject	Score	Model Match	Provenance	Sampled
1	Qwen3.5-397B-A17B	0.70	Qwen3.5 397B A17B qwen-qwen3.5-397b-a17b	Self-reported	2026-05-06
2	Qwen3.5-122B-A10B	0.70	Qwen3.5-122B-A10B qwen-qwen3.5-122b-a10b	Self-reported	2026-05-06
3	Qwen3.5-35B-A3B	0.69	Qwen3.5-35B-A3B qwen-qwen3.5-35b-a3b	Self-reported	2026-05-06
4	LongCat-Flash-Thinking-2601	0.69	—	Self-reported	2026-05-06
5	GLM-4.7	0.67	GLM GLM 4.7 z-ai-glm-4.7	Self-reported	2026-05-06
6	DeepSeek-V3.2	0.65	DeepSeek V3.2 deepseek-deepseek-v3.2	Self-reported	2026-05-06
6	DeepSeek-V3.2 (Thinking)	0.65	R1 deepseek-r1	Self-reported	2026-05-06
8	Kimi K2-Thinking-0905	0.62	KIMI MoonshotAI: Kimi K2 Thinking moonshotai-kimi-k2-thinking	Self-reported	2026-05-06
9	Qwen3.5-27B	0.62	Qwen3.5-27B qwen-qwen3.5-27b	Self-reported	2026-05-06
10	DeepSeek-V3.1	0.49	DeepSeek V3.1 deepseek-deepseek-chat-v3.1	Self-reported	2026-05-06
11	MiniMax M2	0.48	MiniMax M2 minimax-minimax-m2	Self-reported	2026-05-06
12	DeepSeek-V3.2-Exp	0.48	DeepSeek V3.2 Exp deepseek-deepseek-v3.2-exp	Self-reported	2026-05-06
13	DeepSeek-R1-0528	0.36	R1 0528 deepseek-deepseek-r1-0528	Self-reported	2026-05-06

Metadata

Metrics

Latest Results