Continual Learning Bench
A benchmark of expert-validated tasks for agents that learn and improve across sequences of task instances rather than solving independent tasks from scratch.
12rows
agg_rewardprimary metric
2026-05-04sampled
Metadata
Metrics
Agg. Reward, Agg. Gain, Avg. Cost (lower is better)
| Rank | Subject | Agg. Reward | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | ICL - Claude Sonnet 4.6 | 0.22 | — | Imported | 2026-05-04 |
| 2 | ICL - GPT-5.4 | 0.20 | — | Imported | 2026-05-04 |
| 3 | Claude Code - Sonnet 4.6 | 0.19 | — | Imported | 2026-05-04 |
| 4 | Mem0 - GPT-5.4 | 0.15 | — | Imported | 2026-05-04 |
| 5 | ICL - Claude Opus 4.7 | 0.10 | — | Imported | 2026-05-04 |
| 6 | ICL Notepad - GPT-5.4 | 0.08 | — | Imported | 2026-05-04 |
| 7 | ICL - Gemini 3 Flash | 0.08 | — | Imported | 2026-05-04 |
| 8 | Codex - GPT-5.4 | 0.07 | — | Imported | 2026-05-04 |
| 9 | ACE - GPT-5.4 | 0.05 | — | Imported | 2026-05-04 |
| 10 | ICL Notepad - Claude Sonnet 4.6 | 0.03 | — | Imported | 2026-05-04 |
| 11 | ICL Notepad - Gemini 3.1 Pro Preview | -0.00 | — | Imported | 2026-05-04 |
| 12 | ICL - Gemini 3.1 Pro Preview | -0.06 | — | Imported | 2026-05-04 |
No matching rows.