Summary
The agent memory field has converged on two standard benchmarks:
- LongMemEval: 500 manual questions, 5 memory abilities, ~115K token histories. SOTA: Hindsight 91.4%
- LoCoMo: 10 long-term conversations, ~200 questions each, ~27 sessions, 588 turns. SOTA: Hindsight 89.61%
Aletheia's memory architecture is architecturally ahead of the field, but without benchmark numbers, this is a design claim rather than a demonstrated result.
Also relevant
- HaluMem (arxiv 2511.03506): First benchmark for memory hallucination
- ActMemEval: Logic-driven causal reasoning over memory
Aletheia's eval crate (dokimion) could adapt these into scenarios.
Source
LongMemEval, LoCoMo, HaluMem (arxiv 2511.03506).
Summary
The agent memory field has converged on two standard benchmarks:
Aletheia's memory architecture is architecturally ahead of the field, but without benchmark numbers, this is a design claim rather than a demonstrated result.
Also relevant
Aletheia's eval crate (dokimion) could adapt these into scenarios.
Source
LongMemEval, LoCoMo, HaluMem (arxiv 2511.03506).