Summary
Implement periodic and on-demand snapshotting for Raft state in the MetadataService to bound log growth and speed up recovery.
Why it matters
Prevents unbounded log size, reduces startup time, and aligns with production systems.
Scope
- Design snapshot format for metadata state (file-to-chunk mappings, cluster term/index).
- Implement InstallSnapshot/SaveSnapshot integration.
- Trigger policies (size threshold, time-based, and leadership change).
- Backward-compatible restore on restart.
Acceptance Criteria
- Node can restart from snapshot and catch up via incremental logs.
- Logs truncated safely after snapshot persistence.
- Fuzz/chaos tests show correct recovery with mixed snapshot/log replay.
- Benchmarks show ≤30% restart time vs. no compaction at 100k ops.
Notes
Consider pluggable snapshot store (local fs first), atomic write (temp file + rename), and CRC.