MCP-Bench

Benchmark for evaluating LLM agents on complex real-world tool-use tasks through MCP servers, covering schema understanding, LLM-judged task completion, tool usage, and planning effectiveness.

20rows
overall_scoreprimary metric
2026-05-06sampled

Metadata

Metrics

Overall Score

Latest Results

Rows are parsed from the public Accenture MCP-Bench README leaderboard table. Scores are averaged across single-server and multi-server settings per the source README.

Rank Subject Overall Score Model Match Provenance Sampled
1 gpt-5 0.75 Imported 2026-05-06
2 o3 0.71 Imported 2026-05-06
3 gpt-oss-120b 0.69 Imported 2026-05-06
4 gemini-2.5-pro 0.69 Imported 2026-05-06
5 claude-sonnet-4 0.68 Imported 2026-05-06
6 qwen3-235b-a22b-2507 0.68 Imported 2026-05-06
7 glm-4.5 0.67 Imported 2026-05-06
8 gpt-oss-20b 0.65 Imported 2026-05-06
9 kimi-k2 0.63 Imported 2026-05-06
10 qwen3-30b-a3b-instruct-2507 0.63 Imported 2026-05-06
11 gemini-2.5-flash-lite 0.60 Imported 2026-05-06
12 gpt-4o 0.59 Imported 2026-05-06
13 gemma-3-27b-it 0.58 Imported 2026-05-06
14 llama-3-3-70b-instruct 0.56 Imported 2026-05-06
15 gpt-4o-mini 0.56 Imported 2026-05-06
16 mistral-small-2503 0.53 Imported 2026-05-06
17 llama-3-1-70b-instruct 0.51 Imported 2026-05-06
18 nova-micro-v1 0.51 Imported 2026-05-06
19 llama-3-2-90b-vision-instruct 0.49 Imported 2026-05-06
20 llama-3-1-8b-instruct 0.43 Imported 2026-05-06