MCP-Bench
Benchmark for evaluating LLM agents on complex real-world tool-use tasks through MCP servers, covering schema understanding, LLM-judged task completion, tool usage, and planning effectiveness.
20rows
overall_scoreprimary metric
2026-05-06sampled
Metadata
Metrics
Overall Score
| Rank | Subject | Overall Score | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | gpt-5 | 0.75 | — | Imported | 2026-05-06 |
| 2 | o3 | 0.71 | — | Imported | 2026-05-06 |
| 3 | gpt-oss-120b | 0.69 | — | Imported | 2026-05-06 |
| 4 | gemini-2.5-pro | 0.69 | — | Imported | 2026-05-06 |
| 5 | claude-sonnet-4 | 0.68 | — | Imported | 2026-05-06 |
| 6 | qwen3-235b-a22b-2507 | 0.68 | — | Imported | 2026-05-06 |
| 7 | glm-4.5 | 0.67 | — | Imported | 2026-05-06 |
| 8 | gpt-oss-20b | 0.65 | — | Imported | 2026-05-06 |
| 9 | kimi-k2 | 0.63 | — | Imported | 2026-05-06 |
| 10 | qwen3-30b-a3b-instruct-2507 | 0.63 | — | Imported | 2026-05-06 |
| 11 | gemini-2.5-flash-lite | 0.60 | — | Imported | 2026-05-06 |
| 12 | gpt-4o | 0.59 | — | Imported | 2026-05-06 |
| 13 | gemma-3-27b-it | 0.58 | — | Imported | 2026-05-06 |
| 14 | llama-3-3-70b-instruct | 0.56 | — | Imported | 2026-05-06 |
| 15 | gpt-4o-mini | 0.56 | — | Imported | 2026-05-06 |
| 16 | mistral-small-2503 | 0.53 | — | Imported | 2026-05-06 |
| 17 | llama-3-1-70b-instruct | 0.51 | — | Imported | 2026-05-06 |
| 18 | nova-micro-v1 | 0.51 | — | Imported | 2026-05-06 |
| 19 | llama-3-2-90b-vision-instruct | 0.49 | — | Imported | 2026-05-06 |
| 20 | llama-3-1-8b-instruct | 0.43 | — | Imported | 2026-05-06 |
No matching rows.