feat(memory): JSONL-backed session persistence#732
feat(memory): JSONL-backed session persistence#732is-Xiaoen wants to merge 10 commits intosipeed:mainfrom
Conversation
Introduce a backend-agnostic Store interface in pkg/memory/ that maps one-to-one with the current SessionManager API. Each method is atomic — no separate Save() call needed. Refs sipeed#711
Add JSONLStore that persists sessions as .jsonl files (one message per line) plus .meta.json for summary and truncation offset. Key design decisions: - Append-only writes — no full-file rewrites on AddMessage - Logical truncation via skip offset instead of physical deletion - Per-session mutex for safe concurrent access - Crash recovery: malformed trailing lines are silently skipped - Atomic metadata writes using temp+rename Zero new dependencies — pure stdlib. Refs sipeed#711
Cover all Store interface methods plus edge cases: - Basic roundtrip, ordering, empty session, tool calls - Logical truncation (keep last N, keep zero, keep more than exist) - SetHistory replacing all + resetting skip offset - Crash recovery with partial JSON lines - Persistence across store instances - Concurrent add+read (10 goroutines x 20 msgs) - Simulated sipeed#704 race (summarizer vs main loop) - Benchmarks for AddMessage and GetHistory (100/1000 msgs)
Read existing sessions/*.json files, convert to JSONL format, and rename originals to .json.migrated as backup. The migration is idempotent — second runs skip already-migrated files. Session keys are read from JSON content (not filenames) so that sanitized names like telegram_123 correctly map back to telegram:123.
Address file growth concern from sipeed#711 review: logical truncation via skip offset is fast but leaves dead lines on disk indefinitely. Compact() rewrites the JSONL file keeping only active messages, using the same temp+rename pattern for crash safety. No-op when skip == 0. The caller (lifecycle manager or agent loop) decides when to trigger compaction — e.g. when skipped lines exceed active lines.
cfc5ba7 to
b464687
Compare
pkg/memory/jsonl.go
Outdated
| dir string | ||
|
|
||
| mu sync.Mutex | ||
| locks map[string]*sync.Mutex |
There was a problem hiding this comment.
This is usually fine for small tools, but if picoclaw is handling tens of thousands of sessions, it's advisable to consider using an LRU cache to limit the number of locks, or adding a cleanup mechanism to the Close logic. A simpler approach is to directly use sync.Map's LoadOrStore.
There was a problem hiding this comment.
Good call — switched to sync.Map with LoadOrStore. Cleaner and removes the separate mutex entirely. Pushed in 5d73ee2.
| // Allow up to 1 MB per line for messages with large content. | ||
| scanner.Buffer(make([]byte, 0, 64*1024), 1024*1024) | ||
|
|
||
| for scanner.Scan() { |
There was a problem hiding this comment.
Currently, GetHistory performs an
type sessionMeta struct {
// ...
ActiveOffset int64 `json:"active_offset"`
}- Update TruncateHistoryCalculate the byte offset of the
$N$ -th message from the end during truncation.3. Optimize GetHistoryUse f.Seek to jump directly to the active conversation:Go// In GetHistory
if meta.ActiveOffset > 0 {
// Start scanning from here...
if _, err := f.Seek(meta.ActiveOffset, io.SeekStart); err != nil {
return nil, err
}
}Why: This prevents unnecessary CPU/Memory overhead from parsing discarded JSON lines, especially for long-lived sessions.
There was a problem hiding this comment.
Makes sense. I went with a slightly different approach for now: added a skip parameter to readMessages so GetHistory and Compact both skip the first N lines without unmarshaling them. This avoids the CPU cost on truncated lines while keeping the implementation straightforward.
For the byte-offset Seek approach — I think that's a solid next step if sessions get really long. The trade-off right now is that TruncateHistory stays O(1) (just a metadata write), whereas computing byte offsets during truncation would make it O(N). Combined with Compact bounding the file size, the skip-scan approach should hold up well in practice. Happy to add ActiveOffset later if we see it becoming a bottleneck.
Pushed in 5d73ee2.
pkg/memory/jsonl.go
Outdated
|
|
||
| // readMessages reads all valid JSON lines from a .jsonl file. | ||
| // Malformed trailing lines (e.g. from a crash) are silently skipped. | ||
| func readMessages(path string) ([]providers.Message, error) { |
There was a problem hiding this comment.
or we could specify a offset as a param here
There was a problem hiding this comment.
Done — readMessages now takes a skip int parameter. Lines before the offset are scanned but not unmarshaled, saving the JSON parsing overhead. See 5d73ee2.
pkg/memory/jsonl.go
Outdated
| return nil | ||
| } | ||
|
|
||
| all, err := readMessages(s.jsonlPath(sessionKey)) |
There was a problem hiding this comment.
we could save last 20 messages.and using seek to skip the front messages.
There was a problem hiding this comment.
Addressed — Compact now passes meta.Skip to readMessages, so the skipped front lines are scanned without unmarshaling. Same commit (5d73ee2).
…dMessages Address review feedback from @Zhaoyikaiii: - Replace map[string]*sync.Mutex + separate mu with sync.Map.LoadOrStore for simpler, lock-free session lock management. - Add skip parameter to readMessages so callers (GetHistory, Compact) can skip truncated lines without paying the json.Unmarshal cost. - Add countLines helper for TruncateHistory's count reconciliation, avoiding full deserialization when only the line count is needed.
Address feedback from @yinwm for long-running daemon use: - Replace sync.Map with a fixed-size sharded lock array (64 mutexes). Keys are mapped via FNV hash, so memory is O(1) regardless of how many sessions are created over the process lifetime. - Increase scanner buffer cap from 1 MB to 10 MB. Tool results (read_file on large files, web search responses) can easily exceed 1 MB. The scanner still starts at 64 KB and only grows as needed.
|
sync.Map → sharded lock array: Replaced with a fixed Scanner buffer 1MB → 10MB: Tool results ( |
A crash between the JSONL append and the meta update in addMsg can leave meta.Count stale (e.g. file has 101 lines but meta says 100). The previous code only reconciled when Count==0, so a nonzero stale count was silently trusted, causing keepLast/skip to be calculated against the wrong total. Now TruncateHistory always counts the actual lines on disk. This is cheap (scan without unmarshal) and TruncateHistory is not a hot path.
In SetHistory and Compact, the JSONL file was rewritten before updating the meta file. If the process crashed between the two writes, the meta still had a large Skip value pointing past the now-shorter JSONL file, causing GetHistory to return empty — effectively data loss. Reverse the order: write meta (with Skip=0) first, then rewrite JSONL. On crash between the two writes, the old uncompacted file is still intact and GetHistory reads from line 1, returning stale-but-complete data. The next operation self-corrects.
MigrateFromJSON previously called AddFullMessage in a loop, then renamed the .json file to .json.migrated. If the process crashed after appending some messages but before the rename, a retry would re-read the same .json and append all messages again — duplicating whatever was written before the crash. Switch to SetHistory which atomically replaces the session contents. A retry after crash overwrites the partial data instead of appending.
|
这轮重点修了三个 crash safety 的问题: 1. TruncateHistory 对账逻辑不够健壮 之前只在 现在 TruncateHistory 每次都 countLines 对账,反正不是热路径,countLines 也只是扫行不 unmarshal。 2. SetHistory / Compact 的写入顺序有 data loss 风险 之前是先重写 JSONL 再更新 meta。如果在中间崩溃,meta 还保留着旧的 调换了顺序:先写 meta(Skip=0),再重写 JSONL。中间崩溃的最坏情况是旧文件还在、meta 说 Skip=0,GetHistory 会多返回一些已截断的消息,但不会丢数据。下次操作自动修正。 3. Migration 在崩溃重试时会重复写入 之前迁移是 per-message 调 AddFullMessage,全部写完才 rename 改成用 SetHistory 原子替换,重试直接覆盖而不是追加。 每个 fix 都有对应的测试用例验证。 |
nikolasdehor
left a comment
There was a problem hiding this comment.
Excellent work. The design is well-reasoned, the crash safety analysis is thorough, and the test coverage (33 tests including concurrency and migration edge cases) is impressive. A few observations:
1. Lock shard collision could cause correctness issues with countLines.
FNV hash mod 64 means different session keys can share the same mutex shard. If two unrelated sessions happen to share a shard, their operations serialize correctly (just slower). However, countLines in TruncateHistory counts lines for one session while another session sharing the same shard is blocked -- this is fine because the lock prevents concurrent writes to the same session. The shard collision only causes unnecessary serialization between unrelated sessions. No bug here, just confirming the design works.
2. addMsg has a crash window between JSONL append and meta write.
As documented in the PR, if the process crashes after appending to the JSONL but before updating .meta.json, meta.Count becomes stale. The comment says TruncateHistory always re-counts. But GetHistory uses meta.Skip from the potentially-stale meta -- this is still correct because the skip offset has not changed, and the new message appears at the end. Good. However, note that meta.Count itself is never re-reconciled during AddMessage -- it just increments from the (possibly stale) value. So after a crash + recovery, meta.Count could be permanently off by 1 unless TruncateHistory is called. This is harmless since Count is only used by TruncateHistory (which re-counts), but worth documenting.
3. sanitizeKey is a lossy mapping.
As the test TestMigrateFromJSON_ColonInKey correctly notes, telegram:123 and telegram_123 map to the same file. This means a malicious or misconfigured channel name could collide with another. For a personal assistant framework this is acceptable, but consider adding a comment in sanitizeKey noting this is an intentional tradeoff.
4. rewriteJSONL does not call f.Sync() before rename.
On Linux, os.Rename does not guarantee that the file data is flushed to disk. If the OS crashes (not just the process) between f.Close() and after os.Rename, the file could be zero-length or corrupt. Adding f.Sync() before f.Close() in rewriteJSONL would make it crash-safe against OS crashes too. For a "pico" tool this is probably overkill, but since the PR explicitly discusses crash safety, worth mentioning.
5. No fsync on the JSONL append in addMsg either.
Same concern as above -- f.Write + f.Close() does not guarantee durability against power loss. The message could be lost entirely (not just truncated). Again, probably acceptable for the use case.
6. readMessages silently skips corrupt lines.
This is correct JSONL recovery behavior, but it means a partially-written line (from a crash) is silently dropped. If this happens to be a critical user message, it is lost. Consider logging a warning when a non-empty line fails to unmarshal, so operators know data was lost.
Overall this is one of the cleanest new-package PRs I have seen on this project. The interface design is right, the crash safety reasoning is thorough, and the tests are comprehensive.
Summary
Implements
pkg/memory/— a new session persistence layer using append-only JSONL files. This is a revised approach based on the feedback from #719, where @yinwm pointed out that JSONL fits this use case better than SQLite.AddMessageis a single file append, no full-file rewriteTruncateHistoryupdates a skip offset in.meta.jsoninstead of rewriting the message fileCompactrewrites the JSONL file to reclaim disk space after repeated truncations.jsonlfiles work directly withread_file,tail,grep, and agent skillsFile layout
Store interface
The
Storeinterface maps 1:1 to the currentSessionManagerAPI. Each method is an atomic operation — no separateSave()call. It returns structured data (messages + summary separately), which keeps the storage layer clean and lets the prompt builder handle provider-specific caching optimizations downstream.Crash safety design
The two-file design (
.jsonl+.meta.json) introduces crash windows between the two writes. The ordering is carefully chosen to always degrade toward "more data visible" rather than data loss:addMsgmeta.Countstale by 1TruncateHistoryalways re-counts actual linesSetHistorySetHistorycorrectsCompactCompact/TruncateHistorycorrectsConcurrency
Session locking uses a fixed
[64]sync.Mutexsharded array (FNV hash). Memory is O(1) regardless of total session count — important for a long-running daemon.Migration
MigrateFromJSON()reads legacysessions/*.jsonfiles, writes them viaSetHistory(atomic replace), and renames originals to.json.migrated. Idempotent — safe to retry after crash.Why JSONL over SQLite
Per @yinwm's review on #719:
Storeinterface if complex queries are actually neededType of change
Test plan
33 tests total (all passing, 0 lint issues):
Scope
This PR only adds new files under
pkg/memory/— no existing code is modified. Wiring intoAgentLoopwill be a separate PR.Closes #711