Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes a crash-on-restart bug where HNSW-mode databases would fail with a vector index mismatch error during WAL replay recovery. The fix cross-references the loaded HNSW index against the replayed state machine to add missing vectors and remove stale ones before the store is finalized.
Changes:
build_vector_indexis updated to conditionally skip re-indexing entries already present in the loaded HNSW, and to remove entries from HNSW that were deleted via WAL replay after the last checkpoint.- Integration test
test_hnsw_recovery_syncis added to verify that post-checkpoint insertions survive recovery and are correctly searchable.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
crates/cortexadb-core/src/store.rs |
Core recovery sync logic: skip already-indexed entries and remove stale ones from the HNSW on startup |
crates/cortexadb-core/tests/integration.rs |
New integration test exercising HNSW checkpoint + crash + recovery with post-checkpoint insert and delete |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let existing_ids: HashSet<MemoryId> = indexes.vector.indexed_ids().into_iter().collect(); | ||
|
|
||
| // Add or update missing embeddings from the state machine into the loaded HNSW | ||
| for entry in state_machine.all_memories() { | ||
| if let Some(embedding) = entry.embedding.clone() { | ||
| indexes.vector_index_mut().index_in_namespace( | ||
| &entry.namespace, | ||
| entry.id, | ||
| embedding, | ||
| )?; | ||
| if !existing_ids.contains(&entry.id) { | ||
| indexes.vector_index_mut().index_in_namespace( | ||
| &entry.namespace, | ||
| entry.id, | ||
| embedding, | ||
| )?; | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // Remove IDs from HNSW that are no longer in the state machine (e.g. they were deleted in the replayed WAL) | ||
| if has_loaded_hnsw { | ||
| let state_ids: HashSet<MemoryId> = state_machine | ||
| .all_memories() | ||
| .into_iter() | ||
| .filter(|e| e.embedding.is_some()) | ||
| .map(|e| e.id) | ||
| .collect(); | ||
|
|
||
| for existing_id in existing_ids { | ||
| if !state_ids.contains(&existing_id) { | ||
| let _ = indexes.vector_index_mut().remove(existing_id); | ||
| } | ||
| } |
There was a problem hiding this comment.
The stale-entry removal block (lines 891–904) is fundamentally ineffective. It relies on existing_ids, which is computed from indexes.vector.indexed_ids() (line 875) immediately after IndexLayer::new_with_loaded_hnsw(...). At that point, VectorIndex's id_to_namespace map is empty (initialized as HashMap::new() in new_with_loaded_hnsw), so indexed_ids() always returns an empty Vec, and existing_ids is always an empty HashSet.
As a result, stale vectors that are present in the on-disk HNSW backend (e.g., entries deleted via WAL replay after the checkpoint) are never removed from the HNSW backend. They will continue to surface in HNSW-based searches as ghost results, which is the exact corruption scenario this PR claims to fix.
The root cause is that indexed_ids() queries id_to_namespace, which only tracks what was added via index_in_namespace/remove, not what is stored inside the underlying hnsw_backend. To correctly enumerate stale HNSW entries, a method would need to be added to HnswBackend to list its keys (if the usearch crate supports that), or the checkpoint save/load process must also save the ID-to-namespace mapping alongside the HNSW binary, so that indexed_ids() can be populated from that mapping on load.
Because existing_ids is empty before the insertion loop, the "skip already-indexed IDs" optimization in the insertion loop (line 880: if !existing_ids.contains(&entry.id)) is also a no-op on every recovery — all entries are unconditionally re-added, which doubles writes to the HNSW backend (once from disk and once from the re-indexing loop).
| // Perform an HNSW search to ensure the vector index was properly synced during recovery | ||
| let hits = db.ask(vec![1.0, 0.0, 0.0], 5, None).unwrap(); | ||
| assert!(!hits.is_empty()); | ||
| assert_eq!(hits[0].id, id_target, "top hit should be the post-checkpoint entry"); |
There was a problem hiding this comment.
The test does not adequately verify that stale entries (those deleted after the checkpoint but before the crash) are absent from HNSW search results. It only asserts that hits[0].id == id_target, but since the deleted vector [0.0, 0.0, 1.0] is orthogonal to the query [1.0, 0.0, 0.0], it is unlikely to appear as the top hit regardless of whether the stale-removal fix actually works.
To properly validate the stale-removal behavior, the test should either:
- Assert that
id_deleteddoes not appear anywhere in thehitsresults, or - Use a query vector that is closer to
id_deleted's vector than to the others, so the test would fail if the stale entry contaminates search results.
| assert_eq!(hits[0].id, id_target, "top hit should be the post-checkpoint entry"); | |
| assert_eq!(hits[0].id, id_target, "top hit should be the post-checkpoint entry"); | |
| assert!( | |
| hits.iter().all(|h| h.id != id_deleted), | |
| "deleted entry should not appear in HNSW search results" | |
| ); |
| for entry in state_machine.all_memories() { | ||
| if let Some(embedding) = entry.embedding.clone() { | ||
| indexes.vector_index_mut().index_in_namespace( | ||
| &entry.namespace, | ||
| entry.id, | ||
| embedding, | ||
| )?; | ||
| if !existing_ids.contains(&entry.id) { | ||
| indexes.vector_index_mut().index_in_namespace( | ||
| &entry.namespace, | ||
| entry.id, | ||
| embedding, | ||
| )?; | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // Remove IDs from HNSW that are no longer in the state machine (e.g. they were deleted in the replayed WAL) | ||
| if has_loaded_hnsw { | ||
| let state_ids: HashSet<MemoryId> = state_machine | ||
| .all_memories() | ||
| .into_iter() | ||
| .filter(|e| e.embedding.is_some()) | ||
| .map(|e| e.id) | ||
| .collect(); |
There was a problem hiding this comment.
state_machine.all_memories() is called twice in build_vector_index when has_loaded_hnsw is true: once at line 878 and once at line 893. Each call allocates a new Vec, copies all entry references, and sorts them in O(n log n). These two calls could be merged into a single iteration that simultaneously collects state_ids and performs the insertion loop, eliminating one full traversal and allocation. At startup with a large database, this doubles the cost of what should be a one-pass operation.
| // HNSW Recovery sync tests | ||
| // --------------------------------------------------------------------------- | ||
|
|
||
| #[test] |
There was a problem hiding this comment.
All other integration tests that call db.checkpoint() are annotated with #[serial] (see test_checkpoint_recovery_preserves_all_entries, test_double_checkpoint_recovery, and test_delete_then_checkpoint_recovery). The new test calls db.checkpoint() but does not have #[serial]. While each test uses its own TempDir, the convention in this file is to mark all checkpoint-involving tests with #[serial].
| #[test] | |
| #[test] | |
| #[serial] |
Description
Fixes a severe bug where the database crashes on restart with a
vector index mismatcherror when recovering from a crash.During recovery, the engine replays WAL entries, but the HNSW index loaded from disk naturally doesn't have the uncheckpointed vectors. This PR cross-references the loaded HNSW index against the replayed State Machine, and dynamically adds missing vectors / removes stale vectors before finalizing the store open.
Changes
CortexaDBStore::build_vector_indexto hydrate missing vectors and remove stale ones.test_hnsw_recovery_sync()to integration tests.