ContextBench

Benchmark for context retrieval in coding agents, measuring how well agents retrieve and use multi-file code context before producing fixes.

4rows
pass_at_1primary metric
2026-05-06sampled

Metadata

Metrics

Pass@1, Context F1, Efficiency, Avg. Cost (lower is better)

Latest Results

Rows are parsed from the public ContextBench leaderboard table. Source system display names are preserved.

Rank Subject Pass@1 Model Match Provenance Sampled
1 Claude Sonnet 4.5 53 Claude Sonnet 4.5
anthropic-claude-sonnet-4.5
Imported 2026-05-06
2 GPT-5 47.20 GPT-5
openai-gpt-5
Imported 2026-05-06
3 Devstral 2 40.20 Imported 2026-05-06
4 Gemini 2.5 Pro 36.40 Gemini 2.5 Pro
google-gemini-2.5-pro
Imported 2026-05-06