Skip to content

chore: benchmark aletheia memory against LongMemEval and LoCoMo #2854

@forkwright

Description

@forkwright

Summary

The agent memory field has converged on two standard benchmarks:

  • LongMemEval: 500 manual questions, 5 memory abilities, ~115K token histories. SOTA: Hindsight 91.4%
  • LoCoMo: 10 long-term conversations, ~200 questions each, ~27 sessions, 588 turns. SOTA: Hindsight 89.61%

Aletheia's memory architecture is architecturally ahead of the field, but without benchmark numbers, this is a design claim rather than a demonstrated result.

Also relevant

  • HaluMem (arxiv 2511.03506): First benchmark for memory hallucination
  • ActMemEval: Logic-driven causal reasoning over memory

Aletheia's eval crate (dokimion) could adapt these into scenarios.

Source

LongMemEval, LoCoMo, HaluMem (arxiv 2511.03506).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions