LongMemEval-V2
Long-term web-agent memory benchmark evaluating whether memory systems retrieve useful multimodal trajectory evidence for downstream question answering.
6rows
avg_accuracyprimary metric
2026-05-26sampled
Metadata
Metrics
Average Accuracy, LME-V2-Small Accuracy, LME-V2-Medium Accuracy, LME-V2-Small Query Latency (lower is better), LME-V2-Medium Query Latency (lower is better), LME-V2-Small Static, LME-V2-Medium Static, LME-V2-Small Dynamic, LME-V2-Medium Dynamic, LME-V2-Small Procedure, LME-V2-Medium Procedure, LME-V2-Small Gotchas, LME-V2-Medium Gotchas
| Rank | Subject | Average Accuracy | Model Match | Provenance | Sampled |
|---|---|---|---|---|---|
| 1 | AgentRunbook-C | 72.50 | — | Imported | 2026-05-26 |
| 2 | Codex | 69.30 | — | Imported | 2026-05-26 |
| 3 | AgentRunbook-R | 57.80 | — | Imported | 2026-05-26 |
| 4 | RAG: query to slice + notes | 48.50 | — | Imported | 2026-05-26 |
| 5 | RAG: query to slice | 40.50 | — | Imported | 2026-05-26 |
| 6 | No retrieval | 1.30 | — | Imported | 2026-05-26 |
No matching rows.