Continual Learning Bench

A benchmark of expert-validated tasks for agents that learn and improve across sequences of task instances rather than solving independent tasks from scratch.

12rows
agg_rewardprimary metric
2026-05-04sampled

Metadata

Metrics

Agg. Reward, Agg. Gain, Avg. Cost (lower is better)

Latest Results

Rows are agent system configurations, not one score per base model. Aggregate reward is the primary metric from the public leaderboard data.

Rank Subject Agg. Reward Model Match Provenance Sampled
1 ICL - Claude Sonnet 4.6 0.22 Imported 2026-05-04
2 ICL - GPT-5.4 0.20 Imported 2026-05-04
3 Claude Code - Sonnet 4.6 0.19 Imported 2026-05-04
4 Mem0 - GPT-5.4 0.15 Imported 2026-05-04
5 ICL - Claude Opus 4.7 0.10 Imported 2026-05-04
6 ICL Notepad - GPT-5.4 0.08 Imported 2026-05-04
7 ICL - Gemini 3 Flash 0.08 Imported 2026-05-04
8 Codex - GPT-5.4 0.07 Imported 2026-05-04
9 ACE - GPT-5.4 0.05 Imported 2026-05-04
10 ICL Notepad - Claude Sonnet 4.6 0.03 Imported 2026-05-04
11 ICL Notepad - Gemini 3.1 Pro Preview -0.00 Imported 2026-05-04
12 ICL - Gemini 3.1 Pro Preview -0.06 Imported 2026-05-04