TemporalBench
TemporalBench evaluates LLM-based agents on contextual and event-informed time-series tasks spanning multiple datasets and task types.
5rows
overall_mcq_accprimary metric
2026-05-06sampled
Metadata
Metrics
Overall MCQ accuracy, T1 accuracy, T2 accuracy, T3 accuracy, T4 accuracy
| Rank | Subject | Overall MCQ accuracy | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | Single LLM (gpt-4o) | 0.34 | — | Imported | 2026-05-06 |
| 2 | AgentScope (gpt-4o) | 0.33 | — | Imported | 2026-05-06 |
| 3 | CAMEL (gpt-4o) | 0.32 | — | Imported | 2026-05-06 |
| 4 | MetaGPT (gpt-4o) | 0.32 | — | Imported | 2026-05-06 |
| 5 | TimeSeries Scientist (gpt-4o) | 0.24 | — | Imported | 2026-05-06 |
No matching rows.