GitTaskBench

Repository-level code-agent benchmark covering real GitHub tasks, reporting task pass rate, execution completion rate, token usage, and cost.

28rows
task_pass_rateprimary metric
2026-05-27sampled

Metadata

Metrics

Task Pass Rate, Execution Completion Rate, Input Tokens (lower is better), Output Tokens (lower is better), Cost (lower is better)

Latest Results

Rows parsed from GitTaskBench's public app bundle. GitTaskBench evaluates code agents on repository-level real-world tasks.

Rank Subject Task Pass Rate Model Match Provenance Sampled
1 RepoMaster + Claude 3.5 62.96 Imported 2026-05-27
2 OpenHands + Claude 3.7 48.15 Imported 2026-05-27
3 RepoMaster + DeepSeekV3 44.44 Imported 2026-05-27
4 SWE-Agent + Claude 3.7 42.59 Imported 2026-05-27
5 OpenHands + GPT-4.1 42.59 Imported 2026-05-27
6 OpenHands + Claude 3.5 40.74 Imported 2026-05-27
7 RepoMaster + GPT-4o 40.74 Imported 2026-05-27
8 OpenHands + Gemini-2.5-pro 35.19 Imported 2026-05-27
9 SWE-Agent + GPT-4.1 31.48 Imported 2026-05-27
10 OpenHands + Qwen3-32b* 29.63 Imported 2026-05-27
11 OpenHands + DeepSeekV3 26.85 Imported 2026-05-27
12 OpenHands + Qwen3-32b* 25.93 Imported 2026-05-27
13 SWE-Agent + Claude 3.5 22.23 Imported 2026-05-27
14 OpenHands + o3-mini 22.22 Imported 2026-05-27
15 SWE-Agent + o3-mini 20.37 Imported 2026-05-27
16 OpenHands + Llama3.3-70b* 20.37 Imported 2026-05-27
17 SWE-Agent + Llama3.3-70b* 18.52 Imported 2026-05-27
18 Aider + DeepSeekV3 16.67 Imported 2026-05-27
19 OpenHands + GPT-4o 14.82 Imported 2026-05-27
20 Aider + Claude 3.5 12.96 Imported 2026-05-27
21 SWE-Agent + DeepSeekV3 12.04 Imported 2026-05-27
22 SWE-Agent + Qwen3-32b* 11.11 Imported 2026-05-27
23 SWE-Agent + GPT-4o 10.19 Imported 2026-05-27
24 Aider + GPT-4.1 7.41 Imported 2026-05-27
25 OpenHands + Qwen3-14b* 5.56 Imported 2026-05-27
26 SWE-Agent + Qwen3-32b* 3.7 Imported 2026-05-27
27 Aider + GPT-4o 1.85 Imported 2026-05-27
28 OpenHands + Qwen3-8b* 1.85 Imported 2026-05-27