TemporalBench

TemporalBench evaluates LLM-based agents on contextual and event-informed time-series tasks spanning multiple datasets and task types.

5rows
overall_mcq_accprimary metric
2026-05-06sampled

Metadata

Metrics

Overall MCQ accuracy, T1 accuracy, T2 accuracy, T3 accuracy, T4 accuracy

Latest Results

Rank Subject Overall MCQ accuracy Model Match Provenance Sampled
1 Single LLM (gpt-4o) 0.34 Imported 2026-05-06
2 AgentScope (gpt-4o) 0.33 Imported 2026-05-06
3 CAMEL (gpt-4o) 0.32 Imported 2026-05-06
4 MetaGPT (gpt-4o) 0.32 Imported 2026-05-06
5 TimeSeries Scientist (gpt-4o) 0.24 Imported 2026-05-06