ContextBench
Benchmark for context retrieval in coding agents, measuring how well agents retrieve and use multi-file code context before producing fixes.
4rows
pass_at_1primary metric
2026-05-06sampled
Metadata
Metrics
Pass@1, Context F1, Efficiency, Avg. Cost (lower is better)
| Rank | Subject | Pass@1 | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Claude Sonnet 4.5 | 53 | Claude Sonnet 4.5 anthropic-claude-sonnet-4.5 | Imported | 2026-05-06 |
| 2 | GPT-5 | 47.20 | GPT-5 openai-gpt-5 | Imported | 2026-05-06 |
| 3 | Devstral 2 | 40.20 | — | Imported | 2026-05-06 |
| 4 | Gemini 2.5 Pro | 36.40 | Gemini 2.5 Pro google-gemini-2.5-pro | Imported | 2026-05-06 |
No matching rows.