Skip to content

feat(memory): JSONL-backed session persistence#732

Open
is-Xiaoen wants to merge 10 commits intosipeed:mainfrom
is-Xiaoen:feat/jsonl-memory-store
Open

feat(memory): JSONL-backed session persistence#732
is-Xiaoen wants to merge 10 commits intosipeed:mainfrom
is-Xiaoen:feat/jsonl-memory-store

Conversation

@is-Xiaoen
Copy link
Contributor

@is-Xiaoen is-Xiaoen commented Feb 24, 2026

Summary

Implements pkg/memory/ — a new session persistence layer using append-only JSONL files. This is a revised approach based on the feedback from #719, where @yinwm pointed out that JSONL fits this use case better than SQLite.

  • Zero new dependencies — pure stdlib, no go.mod changes
  • Append-only writesAddMessage is a single file append, no full-file rewrite
  • Logical truncationTruncateHistory updates a skip offset in .meta.json instead of rewriting the message file
  • Physical compactionCompact rewrites the JSONL file to reclaim disk space after repeated truncations
  • Agent-readable.jsonl files work directly with read_file, tail, grep, and agent skills
  • Crash-safe — meta is always written before JSONL rewrites; incomplete lines from interrupted writes are silently skipped; TruncateHistory always reconciles meta.Count against the actual file

File layout

sessions/
├── telegram_123456.jsonl       # one message per line, append-only
├── telegram_123456.meta.json   # summary, skip offset, timestamps

Store interface

The Store interface maps 1:1 to the current SessionManager API. Each method is an atomic operation — no separate Save() call. It returns structured data (messages + summary separately), which keeps the storage layer clean and lets the prompt builder handle provider-specific caching optimizations downstream.

type Store interface {
    AddMessage(ctx, key, role, content) error
    AddFullMessage(ctx, key, msg) error
    GetHistory(ctx, key) ([]Message, error)
    GetSummary(ctx, key) (string, error)
    SetSummary(ctx, key, summary) error
    TruncateHistory(ctx, key, keepLast) error
    SetHistory(ctx, key, history) error
    Compact(ctx, key) error
    Close() error
}

Crash safety design

The two-file design (.jsonl + .meta.json) introduces crash windows between the two writes. The ordering is carefully chosen to always degrade toward "more data visible" rather than data loss:

Operation Crash scenario Effect Recovery
addMsg JSONL written, meta not updated meta.Count stale by 1 TruncateHistory always re-counts actual lines
SetHistory Meta written (Skip=0), JSONL not rewritten Old messages temporarily visible Next SetHistory corrects
Compact Meta written (Skip=0), JSONL not rewritten Truncated messages reappear Next Compact/TruncateHistory corrects

Concurrency

Session locking uses a fixed [64]sync.Mutex sharded array (FNV hash). Memory is O(1) regardless of total session count — important for a long-running daemon.

Migration

MigrateFromJSON() reads legacy sessions/*.json files, writes them via SetHistory (atomic replace), and renames originals to .json.migrated. Idempotent — safe to retry after crash.

Why JSONL over SQLite

Per @yinwm's review on #719:

  • JSONL append is already atomic at the OS level — no transaction machinery needed
  • Agent tools can directly read session files, which matters for an AI agent framework
  • Zero external dependencies aligns with the "pico" philosophy
  • Covers 90%+ of use cases; SQLite can be added later behind the same Store interface if complex queries are actually needed

Type of change

  • New feature (non-breaking change which adds functionality)

Test plan

33 tests total (all passing, 0 lint issues):

  • 20 unit tests: basic roundtrip, ordering, empty sessions, tool calls, tool call IDs, summary CRUD, truncation (keep N / keep 0 / keep more than exist / stale meta.Count), SetHistory replace + skip reset, colon-in-key filename mapping, crash recovery with partial lines, persistence across instances, session isolation
  • 3 compaction tests: compact removes skipped lines from disk, no-op when skip=0, append after compact works correctly
  • 2 concurrency tests: 10-goroutine concurrent writes; simulated Race condition in session history causes "tool_call_ids did not have response messages" (HTTP 400) #704 race (summarizer goroutine vs main loop)
  • 3 benchmarks: AddMessage throughput, GetHistory at 100 and 1000 messages
  • 8 migration tests: basic, tool calls, multiple files, invalid JSON skip, rename to .migrated, idempotent, colon-in-key, retry after crash (no duplicates)
$ go test ./pkg/memory/... -v -count=1
ok  	github.com/sipeed/picoclaw/pkg/memory	6.517s

$ golangci-lint run ./pkg/memory/...
0 issues.

Scope

This PR only adds new files under pkg/memory/ — no existing code is modified. Wiring into AgentLoop will be a separate PR.

Closes #711

Introduce a backend-agnostic Store interface in pkg/memory/ that maps
one-to-one with the current SessionManager API. Each method is atomic
— no separate Save() call needed.

Refs sipeed#711
Add JSONLStore that persists sessions as .jsonl files (one message per
line) plus .meta.json for summary and truncation offset.

Key design decisions:
- Append-only writes — no full-file rewrites on AddMessage
- Logical truncation via skip offset instead of physical deletion
- Per-session mutex for safe concurrent access
- Crash recovery: malformed trailing lines are silently skipped
- Atomic metadata writes using temp+rename

Zero new dependencies — pure stdlib.

Refs sipeed#711
Cover all Store interface methods plus edge cases:
- Basic roundtrip, ordering, empty session, tool calls
- Logical truncation (keep last N, keep zero, keep more than exist)
- SetHistory replacing all + resetting skip offset
- Crash recovery with partial JSON lines
- Persistence across store instances
- Concurrent add+read (10 goroutines x 20 msgs)
- Simulated sipeed#704 race (summarizer vs main loop)
- Benchmarks for AddMessage and GetHistory (100/1000 msgs)
Read existing sessions/*.json files, convert to JSONL format, and
rename originals to .json.migrated as backup. The migration is
idempotent — second runs skip already-migrated files.

Session keys are read from JSON content (not filenames) so that
sanitized names like telegram_123 correctly map back to telegram:123.
Address file growth concern from sipeed#711 review: logical truncation via
skip offset is fast but leaves dead lines on disk indefinitely.

Compact() rewrites the JSONL file keeping only active messages, using
the same temp+rename pattern for crash safety. No-op when skip == 0.
The caller (lifecycle manager or agent loop) decides when to trigger
compaction — e.g. when skipped lines exceed active lines.
dir string

mu sync.Mutex
locks map[string]*sync.Mutex
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is usually fine for small tools, but if picoclaw is handling tens of thousands of sessions, it's advisable to consider using an LRU cache to limit the number of locks, or adding a cleanup mechanism to the Close logic. A simpler approach is to directly use sync.Map's LoadOrStore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — switched to sync.Map with LoadOrStore. Cleaner and removes the separate mutex entirely. Pushed in 5d73ee2.

// Allow up to 1 MB per line for messages with large content.
scanner.Buffer(make([]byte, 0, 64*1024), 1024*1024)

for scanner.Scan() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, GetHistory performs an $O(N)$ scan of the entire JSONL file, which will degrade performance as session files grow. We can achieve $O(1)$ access by using byte offsets.1. Update Metadata SchemaAdd ActiveOffset to track the byte position of the first valid message.

type sessionMeta struct {
    // ...
    ActiveOffset int64 `json:"active_offset"` 
}
  1. Update TruncateHistoryCalculate the byte offset of the $N$-th message from the end during truncation.3. Optimize GetHistoryUse f.Seek to jump directly to the active conversation:Go// In GetHistory
if meta.ActiveOffset > 0 {
    // Start scanning from here...
    if _, err := f.Seek(meta.ActiveOffset, io.SeekStart); err != nil {
        return nil, err
    }
}

Why: This prevents unnecessary CPU/Memory overhead from parsing discarded JSON lines, especially for long-lived sessions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I went with a slightly different approach for now: added a skip parameter to readMessages so GetHistory and Compact both skip the first N lines without unmarshaling them. This avoids the CPU cost on truncated lines while keeping the implementation straightforward.

For the byte-offset Seek approach — I think that's a solid next step if sessions get really long. The trade-off right now is that TruncateHistory stays O(1) (just a metadata write), whereas computing byte offsets during truncation would make it O(N). Combined with Compact bounding the file size, the skip-scan approach should hold up well in practice. Happy to add ActiveOffset later if we see it becoming a bottleneck.

Pushed in 5d73ee2.


// readMessages reads all valid JSON lines from a .jsonl file.
// Malformed trailing lines (e.g. from a crash) are silently skipped.
func readMessages(path string) ([]providers.Message, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or we could specify a offset as a param here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — readMessages now takes a skip int parameter. Lines before the offset are scanned but not unmarshaled, saving the JSON parsing overhead. See 5d73ee2.

return nil
}

all, err := readMessages(s.jsonlPath(sessionKey))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could save last 20 messages.and using seek to skip the front messages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — Compact now passes meta.Skip to readMessages, so the skipped front lines are scanned without unmarshaling. Same commit (5d73ee2).

…dMessages

Address review feedback from @Zhaoyikaiii:

- Replace map[string]*sync.Mutex + separate mu with sync.Map.LoadOrStore
  for simpler, lock-free session lock management.

- Add skip parameter to readMessages so callers (GetHistory, Compact)
  can skip truncated lines without paying the json.Unmarshal cost.

- Add countLines helper for TruncateHistory's count reconciliation,
  avoiding full deserialization when only the line count is needed.
Address feedback from @yinwm for long-running daemon use:

- Replace sync.Map with a fixed-size sharded lock array (64 mutexes).
  Keys are mapped via FNV hash, so memory is O(1) regardless of how
  many sessions are created over the process lifetime.

- Increase scanner buffer cap from 1 MB to 10 MB. Tool results
  (read_file on large files, web search responses) can easily exceed
  1 MB. The scanner still starts at 64 KB and only grows as needed.
@is-Xiaoen
Copy link
Contributor Author

is-Xiaoen commented Feb 26, 2026

sync.Map → sharded lock array: Replaced with a fixed [64]sync.Mutex pool, keys mapped via FNV hash. Memory is O(1) regardless of session count — no growth over the daemon's lifetime.

Scanner buffer 1MB → 10MB: Tool results (read_file on large files, web search dumps, etc.) can exceed 1MB easily. Bumped the cap to 10MB. The scanner still starts at 64KB and grows lazily, so normal messages don't pay for it.

A crash between the JSONL append and the meta update in addMsg can
leave meta.Count stale (e.g. file has 101 lines but meta says 100).
The previous code only reconciled when Count==0, so a nonzero stale
count was silently trusted, causing keepLast/skip to be calculated
against the wrong total.

Now TruncateHistory always counts the actual lines on disk. This is
cheap (scan without unmarshal) and TruncateHistory is not a hot path.
In SetHistory and Compact, the JSONL file was rewritten before updating
the meta file. If the process crashed between the two writes, the meta
still had a large Skip value pointing past the now-shorter JSONL file,
causing GetHistory to return empty — effectively data loss.

Reverse the order: write meta (with Skip=0) first, then rewrite JSONL.
On crash between the two writes, the old uncompacted file is still
intact and GetHistory reads from line 1, returning stale-but-complete
data. The next operation self-corrects.
MigrateFromJSON previously called AddFullMessage in a loop, then
renamed the .json file to .json.migrated. If the process crashed
after appending some messages but before the rename, a retry would
re-read the same .json and append all messages again — duplicating
whatever was written before the crash.

Switch to SetHistory which atomically replaces the session contents.
A retry after crash overwrites the partial data instead of appending.
@is-Xiaoen
Copy link
Contributor Author

这轮重点修了三个 crash safety 的问题:

1. TruncateHistory 对账逻辑不够健壮

之前只在 meta.Count == 0 时才去数文件行数。但 addMsg 写 JSONL 和写 meta 是两步操作,中间崩溃会导致 meta.Count 偏小(比如文件 101 行但 meta 说 100),非零的 stale count 直接被信任,keepLast 算出来的 skip 就会偏。

现在 TruncateHistory 每次都 countLines 对账,反正不是热路径,countLines 也只是扫行不 unmarshal。

2. SetHistory / Compact 的写入顺序有 data loss 风险

之前是先重写 JSONL 再更新 meta。如果在中间崩溃,meta 还保留着旧的 Skip=90,但 JSONL 文件只剩 10 行了——GetHistory 从第 91 行开始读,直接返回空。

调换了顺序:先写 meta(Skip=0),再重写 JSONL。中间崩溃的最坏情况是旧文件还在、meta 说 Skip=0,GetHistory 会多返回一些已截断的消息,但不会丢数据。下次操作自动修正。

3. Migration 在崩溃重试时会重复写入

之前迁移是 per-message 调 AddFullMessage,全部写完才 rename .json.json.migrated。如果写了 50 条消息后崩溃(rename 还没执行),重试会把 100 条全部再写一遍,变成 150 条。

改成用 SetHistory 原子替换,重试直接覆盖而不是追加。

每个 fix 都有对应的测试用例验证。

Copy link

@nikolasdehor nikolasdehor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work. The design is well-reasoned, the crash safety analysis is thorough, and the test coverage (33 tests including concurrency and migration edge cases) is impressive. A few observations:

1. Lock shard collision could cause correctness issues with countLines.
FNV hash mod 64 means different session keys can share the same mutex shard. If two unrelated sessions happen to share a shard, their operations serialize correctly (just slower). However, countLines in TruncateHistory counts lines for one session while another session sharing the same shard is blocked -- this is fine because the lock prevents concurrent writes to the same session. The shard collision only causes unnecessary serialization between unrelated sessions. No bug here, just confirming the design works.

2. addMsg has a crash window between JSONL append and meta write.
As documented in the PR, if the process crashes after appending to the JSONL but before updating .meta.json, meta.Count becomes stale. The comment says TruncateHistory always re-counts. But GetHistory uses meta.Skip from the potentially-stale meta -- this is still correct because the skip offset has not changed, and the new message appears at the end. Good. However, note that meta.Count itself is never re-reconciled during AddMessage -- it just increments from the (possibly stale) value. So after a crash + recovery, meta.Count could be permanently off by 1 unless TruncateHistory is called. This is harmless since Count is only used by TruncateHistory (which re-counts), but worth documenting.

3. sanitizeKey is a lossy mapping.
As the test TestMigrateFromJSON_ColonInKey correctly notes, telegram:123 and telegram_123 map to the same file. This means a malicious or misconfigured channel name could collide with another. For a personal assistant framework this is acceptable, but consider adding a comment in sanitizeKey noting this is an intentional tradeoff.

4. rewriteJSONL does not call f.Sync() before rename.
On Linux, os.Rename does not guarantee that the file data is flushed to disk. If the OS crashes (not just the process) between f.Close() and after os.Rename, the file could be zero-length or corrupt. Adding f.Sync() before f.Close() in rewriteJSONL would make it crash-safe against OS crashes too. For a "pico" tool this is probably overkill, but since the PR explicitly discusses crash safety, worth mentioning.

5. No fsync on the JSONL append in addMsg either.
Same concern as above -- f.Write + f.Close() does not guarantee durability against power loss. The message could be lost entirely (not just truncated). Again, probably acceptable for the use case.

6. readMessages silently skips corrupt lines.
This is correct JSONL recovery behavior, but it means a partially-written line (from a crash) is silently dropped. If this happens to be a critical user message, it is lost. Consider logging a warning when a non-empty line fails to unmarshal, so operators know data was lost.

Overall this is one of the cleanest new-package PRs I have seen on this project. The interface design is right, the crash safety reasoning is thorough, and the tests are comprehensive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] JSONL-backed session persistence with Store interface

3 participants