LongMemEval-V2

Long-term web-agent memory benchmark evaluating whether memory systems retrieve useful multimodal trajectory evidence for downstream question answering.

6rows
avg_accuracyprimary metric
2026-05-26sampled

Metadata

Metrics

Average Accuracy, LME-V2-Small Accuracy, LME-V2-Medium Accuracy, LME-V2-Small Query Latency (lower is better), LME-V2-Medium Query Latency (lower is better), LME-V2-Small Static, LME-V2-Medium Static, LME-V2-Small Dynamic, LME-V2-Medium Dynamic, LME-V2-Small Procedure, LME-V2-Medium Procedure, LME-V2-Small Gotchas, LME-V2-Medium Gotchas

Latest Results

Released baseline and AgentRunbook operating points. Rows are ranked by mean Small/Medium accuracy because published LAFS Gain values are null.

Rank Subject Average Accuracy Model Match Provenance Sampled
1 AgentRunbook-C 72.50 Imported 2026-05-26
2 Codex 69.30 Imported 2026-05-26
3 AgentRunbook-R 57.80 Imported 2026-05-26
4 RAG: query to slice + notes 48.50 Imported 2026-05-26
5 RAG: query to slice 40.50 Imported 2026-05-26
6 No retrieval 1.30 Imported 2026-05-26