MCPMark
MCP-based agent benchmark evaluating language-model agents across real-world tool environments including filesystem, GitHub, Notion, browser automation, and Postgres tasks.
45rows
pass_at_1_avgprimary metric
2026-05-28sampled
Metadata
Metrics
Pass@1, Pass@1 std (lower is better), Pass@4, Pass^4, Avg agent execution time (lower is better), Avg turns (lower is better), Avg input tokens (lower is better), Avg output tokens (lower is better), Avg total tokens (lower is better), Per-run cost (lower is better), Filesystem Pass@1, Github Pass@1, Notion Pass@1, Playwright Pass@1, Postgres Pass@1
Showing 2 latest source slices.
| Rank | Subject | Pass@1 | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Qwen3.7 Max | 60.8% | Qwen3.7 Max qwen-qwen3.7-max | Self-reported | 2026-05-28 |
| 2 | GLM-5.1 Thinking | 57.5% | GLM 5.1 z-ai-glm-5.1 | Self-reported | 2026-05-28 |
| 3 | DeepSeek V4 Pro Max | 57.1% | DeepSeek V4 Pro deepseek-deepseek-v4-pro | Self-reported | 2026-05-28 |
| 4 | Claude Opus 4.6 Max | 56.7% | Claude Opus 4.6 anthropic-claude-opus-4.6 | Self-reported | 2026-05-28 |
| 5 | Kimi K2.6 Thinking | 55.9% | MoonshotAI: Kimi K2.6 moonshotai-kimi-k2.6 | Self-reported | 2026-05-28 |
| 6 | Qwen3.6 Plus | 48.2% | Qwen3.6 Plus qwen-qwen3.6-plus | Self-reported | 2026-05-28 |
| 1 | gpt-5-2-high | 0.57 | GPT-5.2 openai-gpt-5.2 | Imported | 2026-05-06 |
| 2 | gemini-3-pro-high | 0.54 | Gemini 3 google-gemini-3 | Imported | 2026-05-06 |
| 3 | gpt-5-medium | 0.53 | GPT-5 openai-gpt-5 | Imported | 2026-05-06 |
| 4 | gpt-5-high | 0.52 | GPT-5 openai-gpt-5 | Imported | 2026-05-06 |
| 5 | gemini-3-pro-low | 0.51 | Gemini 3 google-gemini-3 | Imported | 2026-05-06 |
| 6 | gpt-5-low | 0.47 | GPT-5 openai-gpt-5 | Imported | 2026-05-06 |
| 7 | claude-opus-4-5-high | 0.42 | Claude Opus 4.5 anthropic-claude-opus-4.5 | Imported | 2026-05-06 |
| 8 | deepseek-v3-2-thinking | 0.37 | DeepSeek V3.2 deepseek-deepseek-v3.2 | Imported | 2026-05-06 |
| 9 | claude-sonnet-4-5 | 0.32 | Claude Sonnet 4.5 anthropic-claude-sonnet-4.5 | Imported | 2026-05-06 |
| 10 | grok-4 | 0.32 | Grok 4 x-ai-grok-4 | Imported | 2026-05-06 |
| 11 | gpt-5-mini-high | 0.30 | GPT-5 Mini openai-gpt-5-mini | Imported | 2026-05-06 |
| 12 | claude-opus-4-1 | 0.30 | Claude Opus 4.1 anthropic-claude-opus-4.1 | Imported | 2026-05-06 |
| 13 | deepseek-v3-2-chat | 0.30 | — | Imported | 2026-05-06 |
| 14 | claude-sonnet-4-high | 0.28 | — | Imported | 2026-05-06 |
| 15 | claude-sonnet-4 | 0.28 | Claude Sonnet 4 anthropic-claude-sonnet-4 | Imported | 2026-05-06 |
| 16 | claude-sonnet-4-low | 0.27 | — | Imported | 2026-05-06 |
| 17 | gpt-5-mini-medium | 0.27 | GPT-5 Mini openai-gpt-5-mini | Imported | 2026-05-06 |
| 18 | o3 | 0.25 | o3 openai-o3 | Imported | 2026-05-06 |
| 19 | qwen-3-coder-plus | 0.25 | Qwen3 Coder Plus qwen-qwen3-coder-plus | Imported | 2026-05-06 |
| 20 | grok-4-fast | 0.24 | Grok 4 Fast x-ai-grok-4-fast | Imported | 2026-05-06 |
| 21 | kimi-k2-0905 | 0.22 | MoonshotAI: Kimi K2 0905 moonshotai-kimi-k2-0905 | Imported | 2026-05-06 |
| 22 | deepseek-v3-1-terminus-thinking | 0.21 | — | Imported | 2026-05-06 |
| 23 | grok-code-fast-1 | 0.20 | Grok Code Fast 1 x-ai-grok-code-fast-1 | Imported | 2026-05-06 |
| 24 | kimi-k2-0711 | 0.19 | — | Imported | 2026-05-06 |
| 25 | qwen-3-max | 0.18 | Qwen3 Max qwen-qwen3-max | Imported | 2026-05-06 |
| 26 | o4-mini | 0.17 | o4 Mini openai-o4-mini | Imported | 2026-05-06 |
| 27 | deepseek-chat | 0.17 | DeepSeek V3 deepseek-deepseek-chat | Imported | 2026-05-06 |
| 28 | deepseek-v3-1-terminus | 0.17 | DeepSeek V3.1 Terminus deepseek-deepseek-v3.1-terminus | Imported | 2026-05-06 |
| 29 | gemini-2-5-pro | 0.16 | Gemini 2.5 Pro google-gemini-2.5-pro | Imported | 2026-05-06 |
| 30 | glm-4-5 | 0.16 | GLM 4.5 z-ai-glm-4.5 | Imported | 2026-05-06 |
| 31 | gemini-2-5-flash | 0.09 | Gemini 2.5 Flash google-gemini-2.5-flash | Imported | 2026-05-06 |
| 32 | gpt-5-mini-low | 0.08 | GPT-5 Mini openai-gpt-5-mini | Imported | 2026-05-06 |
| 33 | gpt-4-1 | 0.08 | GPT-4.1 openai-gpt-4.1 | Imported | 2026-05-06 |
| 34 | gpt-5-nano-medium | 0.06 | GPT-5 Nano openai-gpt-5-nano | Imported | 2026-05-06 |
| 35 | gpt-5-nano-high | 0.05 | GPT-5 Nano openai-gpt-5-nano | Imported | 2026-05-06 |
| 36 | gpt-oss-120b | 0.05 | gpt-oss-120b openai-gpt-oss-120b | Imported | 2026-05-06 |
| 37 | gpt-5-nano-low | 0.04 | GPT-5 Nano openai-gpt-5-nano | Imported | 2026-05-06 |
| 38 | gpt-4-1-mini | 0.04 | GPT-4.1 Mini openai-gpt-4.1-mini | Imported | 2026-05-06 |
| 39 | gpt-4-1-nano | 0 | GPT-4.1 Nano openai-gpt-4.1-nano | Imported | 2026-05-06 |
No matching rows.