SWE-Gym

SWE-Gym: Evaluates software-engineering agents on realistic issue resolution, repository navigation, testing, or maintenance workflows.

10rows
success_rateprimary metric
2026-05-27sampled

Metadata

Metrics

Success rate, Success trajectories, Max turns (lower is better), Sampling temperature (lower is better)

Latest Results

Rows are transcribed from the SWE-Gym arXiv source sampled-trajectories table. The primary score is success trajectory percentage on SWE-Gym Lite or Full.

Rank Subject Success rate Model Match Provenance Sampled
1 claude-3-5-sonnet-20241022 on SWE-Gym Lite (t=0, max_turns=50) 29.1% Imported 2026-05-27
2 gpt-4o-2024-08-06 on SWE-Gym Lite (t=0.4, max_turns=30) 9.13% GPT-4o
openai-gpt-4o
Imported 2026-05-27
3 gpt-4o-2024-08-06 on SWE-Gym Lite (t=0.8, max_turns=30) 8.7% GPT-4o
openai-gpt-4o
Imported 2026-05-27
4 gpt-4o-2024-08-06 on SWE-Gym Lite (t=0, max_turns=30) 8.26% GPT-4o
openai-gpt-4o
Imported 2026-05-27
5 gpt-4o-2024-08-06 on SWE-Gym Lite (t=0, max_turns=50) 8.26% GPT-4o
openai-gpt-4o
Imported 2026-05-27
6 gpt-4o-2024-08-06 on SWE-Gym Lite (t=0.5, max_turns=30) 7.83% GPT-4o
openai-gpt-4o
Imported 2026-05-27
7 gpt-4o-2024-08-06 on SWE-Gym Full (t=1, max_turns=50) 7.71% GPT-4o
openai-gpt-4o
Imported 2026-05-27
8 gpt-4o-2024-08-06 on SWE-Gym Lite (t=0.3, max_turns=30) 7.39% GPT-4o
openai-gpt-4o
Imported 2026-05-27
9 gpt-4o-2024-08-06 on SWE-Gym Lite (t=0.2, max_turns=30) 4.78% GPT-4o
openai-gpt-4o
Imported 2026-05-27
10 gpt-4o-2024-08-06 on SWE-Gym Full (t=0, max_turns=50) 4.55% GPT-4o
openai-gpt-4o
Imported 2026-05-27