DedeuceBench

LLM-agent active-learning benchmark with budgeted search results, trap-free rate, query usage, and token-count telemetry.

4rows
score100primary metric
2026-05-06sampled

Metadata

Metrics

Score100, Success@Budget, TrapFreeRate, EffSucc, QueriesUsed (lower is better), BudgetLeft, TokensIn (lower is better), TokensOut (lower is better), TokensTotal (lower is better)

Latest Results

Rows are parsed from the public Hugging Face leaderboard.csv. Score100 is the primary metric defined by the dataset card as 100 x Success@Budget. Duplicate source model/split rows are preserved as distinct submitted rows.

Rank Subject Score100 Model Match Provenance Sampled
1 openai:gpt-5-mini 95.31 Imported 2026-05-06
2 openai:gpt-5-nano 50 Imported 2026-05-06
3 openai:gpt-5-mini 39.06 Imported 2026-05-06
4 openai:gpt-5-nano 1.56 Imported 2026-05-06