Summary
Two papers introduce RL-trained memory management — the single biggest paradigm shift in the agent memory field:
Memory-R1 (Aug 2025, arxiv 2508.19828)
- RL-trained ADD/UPDATE/DELETE/NOOP operations via PPO/GRPO
- Only 152 training examples needed
- Memory Manager learns optimal operations from outcome-driven reward
Mem-alpha (Sep 2025, arxiv 2509.25911)
- Trained on 30K token sequences, generalizes to 400K+ (13x training length)
- Reward signal from downstream QA accuracy
- Builds core/episodic/semantic memory with multiple tools
Why this matters
Aletheia currently uses hand-coded Datalog rules for memory management decisions (what to store, when to decay, what to consolidate). These papers suggest that RL-learned policies can outperform hand-coded rules and generalize to unseen scenarios.
Proposed evaluation
- Define aletheia's memory management as an MDP: state = current knowledge graph, actions = store/update/decay/consolidate, reward = downstream task success
- Benchmark current hand-coded rules against this formulation
- Evaluate whether a small RL-trained policy (152-30K examples) could improve memory management decisions
- If promising, integrate as a learned decay/consolidation policy alongside existing rules
Risk
Self-reinforcing error: agent incorrectly learns to avoid a memory path and never corrects (identified by SAGE, arxiv 2409.00872). Prosoche self-audit would need to detect this.
Source
Memory-R1 (arxiv 2508.19828), Mem-alpha (arxiv 2509.25911)
Summary
Two papers introduce RL-trained memory management — the single biggest paradigm shift in the agent memory field:
Memory-R1 (Aug 2025, arxiv 2508.19828)
Mem-alpha (Sep 2025, arxiv 2509.25911)
Why this matters
Aletheia currently uses hand-coded Datalog rules for memory management decisions (what to store, when to decay, what to consolidate). These papers suggest that RL-learned policies can outperform hand-coded rules and generalize to unseen scenarios.
Proposed evaluation
Risk
Self-reinforcing error: agent incorrectly learns to avoid a memory path and never corrects (identified by SAGE, arxiv 2409.00872). Prosoche self-audit would need to detect this.
Source
Memory-R1 (arxiv 2508.19828), Mem-alpha (arxiv 2509.25911)