Skip to content

research: evaluate RL-learned memory policies (Memory-R1, Mem-alpha) #2850

@forkwright

Description

@forkwright

Summary

Two papers introduce RL-trained memory management — the single biggest paradigm shift in the agent memory field:

Memory-R1 (Aug 2025, arxiv 2508.19828)

  • RL-trained ADD/UPDATE/DELETE/NOOP operations via PPO/GRPO
  • Only 152 training examples needed
  • Memory Manager learns optimal operations from outcome-driven reward

Mem-alpha (Sep 2025, arxiv 2509.25911)

  • Trained on 30K token sequences, generalizes to 400K+ (13x training length)
  • Reward signal from downstream QA accuracy
  • Builds core/episodic/semantic memory with multiple tools

Why this matters

Aletheia currently uses hand-coded Datalog rules for memory management decisions (what to store, when to decay, what to consolidate). These papers suggest that RL-learned policies can outperform hand-coded rules and generalize to unseen scenarios.

Proposed evaluation

  1. Define aletheia's memory management as an MDP: state = current knowledge graph, actions = store/update/decay/consolidate, reward = downstream task success
  2. Benchmark current hand-coded rules against this formulation
  3. Evaluate whether a small RL-trained policy (152-30K examples) could improve memory management decisions
  4. If promising, integrate as a learned decay/consolidation policy alongside existing rules

Risk

Self-reinforcing error: agent incorrectly learns to avoid a memory path and never corrects (identified by SAGE, arxiv 2409.00872). Prosoche self-audit would need to detect this.

Source

Memory-R1 (arxiv 2508.19828), Mem-alpha (arxiv 2509.25911)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions