MCP-Universe
Benchmark for LLMs and agents using real-world Model Context Protocol servers across location navigation, repository management, finance, 3D design, browser automation, and web search tasks.
28rows
overall_success_rateprimary metric
2026-05-06sampled
Metadata
Metrics
Overall Success Rate, Location Navigation, Repository Management, Financial Analysis, 3D Designing, Browser Automation, Web Searching, Average Evaluator Score, Average Steps (lower is better)
| Rank | Subject | Overall Success Rate | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | GPT-5-High | 44.16 | GPT-5 openai-gpt-5 | Imported | 2026-05-06 |
| 2 | GPT-5-Medium | 43.72 | GPT-5 openai-gpt-5 | Imported | 2026-05-06 |
| 3 | Grok-4 | 33.33 | Grok 4 x-ai-grok-4 | Imported | 2026-05-06 |
| 4 | Claude-4.0-Sonnet-Thinking | 30.30 | — | Imported | 2026-05-06 |
| 5 | Claude-4.1-Opus | 29.44 | — | Imported | 2026-05-06 |
| 6 | Claude-4.0-Sonnet | 29.44 | — | Imported | 2026-05-06 |
| 7 | Claude-4.0-Opus | 28.14 | — | Imported | 2026-05-06 |
| 8 | Grok-4-Fast | 27.27 | Grok 4 Fast x-ai-grok-4-fast | Imported | 2026-05-06 |
| 9 | Grok-Code-Fast-1 | 26.41 | Grok Code Fast 1 x-ai-grok-code-fast-1 | Imported | 2026-05-06 |
| 10 | o3-Medium | 26.41 | — | Imported | 2026-05-06 |
| 11 | o4-mini-Medium | 25.97 | — | Imported | 2026-05-06 |
| 12 | GLM-4.6 | 25.97 | GLM 4.6 z-ai-glm-4.6 | Imported | 2026-05-06 |
| 13 | GLM-4.5 | 24.68 | GLM 4.5 z-ai-glm-4.5 | Imported | 2026-05-06 |
| 14 | Claude-3.7-Sonnet | 24.24 | Claude 3.7 Sonnet anthropic-claude-3.7-sonnet | Imported | 2026-05-06 |
| 15 | Qwen3-Coder-480B-A35B-Instruct | 22.94 | Qwen3 Coder 480B A35B qwen-qwen3-coder | Imported | 2026-05-06 |
| 16 | Gemini-2.5-Pro | 22.08 | Gemini 2.5 Pro google-gemini-2.5-pro | Imported | 2026-05-06 |
| 17 | DeepSeek-V3.1 | 22.08 | DeepSeek V3.1 deepseek-deepseek-chat-v3.1 | Imported | 2026-05-06 |
| 18 | Gemini-2.5-Flash | 21.65 | Gemini 2.5 Flash google-gemini-2.5-flash | Imported | 2026-05-06 |
| 19 | DeepSeek-V3.1-Terminus | 21.65 | DeepSeek V3.1 Terminus deepseek-deepseek-v3.1-terminus | Imported | 2026-05-06 |
| 20 | DeepSeek-V3.2-Exp | 19.91 | DeepSeek V3.2 Exp deepseek-deepseek-v3.2-exp | Imported | 2026-05-06 |
| 21 | Kimi-K2-0905 | 19.91 | MoonshotAI: Kimi K2 0905 moonshotai-kimi-k2-0905 | Imported | 2026-05-06 |
| 22 | GLM-4.5-Air | 19.48 | GLM 4.5 Air z-ai-glm-4.5-air | Imported | 2026-05-06 |
| 23 | Kimi-K2-0711 | 19.05 | — | Imported | 2026-05-06 |
| 24 | GPT-4.1 | 18.18 | GPT-4.1 openai-gpt-4.1 | Imported | 2026-05-06 |
| 25 | Qwen3-Max-Preview (Instruct) | 18.18 | — | Imported | 2026-05-06 |
| 26 | Qwen3-235B-A22B-Instruct-2507 | 18.18 | Qwen3 235B A22B Instruct 2507 qwen-qwen3-235b-a22b-2507 | Imported | 2026-05-06 |
| 27 | GPT-4o-2024-08-06 | 15.58 | GPT-4o (2024-08-06) openai-gpt-4o-2024-08-06 | Imported | 2026-05-06 |
| 28 | DeepSeek-V3 | 14.29 | DeepSeek V3 deepseek-deepseek-chat | Imported | 2026-05-06 |
No matching rows.