fix(core): sync missing vectors from state machine to HNSW on recovery by anaslimem · Pull Request #11 · anaslimem/CortexaDB

anaslimem · 2026-03-04T13:32:07Z

Description

Fixes a severe bug where the database crashes on restart with a vector index mismatch error when recovering from a crash.

During recovery, the engine replays WAL entries, but the HNSW index loaded from disk naturally doesn't have the uncheckpointed vectors. This PR cross-references the loaded HNSW index against the replayed State Machine, and dynamically adds missing vectors / removes stale vectors before finalizing the store open.

Changes

Updated CortexaDBStore::build_vector_index to hydrate missing vectors and remove stale ones.
Added test_hnsw_recovery_sync() to integration tests.

Copilot

Pull request overview

This PR fixes a crash-on-restart bug where HNSW-mode databases would fail with a vector index mismatch error during WAL replay recovery. The fix cross-references the loaded HNSW index against the replayed state machine to add missing vectors and remove stale ones before the store is finalized.

Changes:

build_vector_index is updated to conditionally skip re-indexing entries already present in the loaded HNSW, and to remove entries from HNSW that were deleted via WAL replay after the last checkpoint.
Integration test test_hnsw_recovery_sync is added to verify that post-checkpoint insertions survive recovery and are correctly searchable.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
`crates/cortexadb-core/src/store.rs`	Core recovery sync logic: skip already-indexed entries and remove stale ones from the HNSW on startup
`crates/cortexadb-core/tests/integration.rs`	New integration test exercising HNSW checkpoint + crash + recovery with post-checkpoint insert and delete

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-04T13:36:48Z

crates/cortexadb-core/src/store.rs

+        let existing_ids: HashSet<MemoryId> = indexes.vector.indexed_ids().into_iter().collect();

+        // Add or update missing embeddings from the state machine into the loaded HNSW
        for entry in state_machine.all_memories() {
            if let Some(embedding) = entry.embedding.clone() {
-                indexes.vector_index_mut().index_in_namespace(
-                    &entry.namespace,
-                    entry.id,
-                    embedding,
-                )?;
+                if !existing_ids.contains(&entry.id) {
+                    indexes.vector_index_mut().index_in_namespace(
+                        &entry.namespace,
+                        entry.id,
+                        embedding,
+                    )?;
+                }
            }
        }
+
+        // Remove IDs from HNSW that are no longer in the state machine (e.g. they were deleted in the replayed WAL)
+        if has_loaded_hnsw {
+            let state_ids: HashSet<MemoryId> = state_machine
+                .all_memories()
+                .into_iter()
+                .filter(|e| e.embedding.is_some())
+                .map(|e| e.id)
+                .collect();
+
+            for existing_id in existing_ids {
+                if !state_ids.contains(&existing_id) {
+                    let _ = indexes.vector_index_mut().remove(existing_id);
+                }
+            }


The stale-entry removal block (lines 891–904) is fundamentally ineffective. It relies on existing_ids, which is computed from indexes.vector.indexed_ids() (line 875) immediately after IndexLayer::new_with_loaded_hnsw(...). At that point, VectorIndex's id_to_namespace map is empty (initialized as HashMap::new() in new_with_loaded_hnsw), so indexed_ids() always returns an empty Vec, and existing_ids is always an empty HashSet.

As a result, stale vectors that are present in the on-disk HNSW backend (e.g., entries deleted via WAL replay after the checkpoint) are never removed from the HNSW backend. They will continue to surface in HNSW-based searches as ghost results, which is the exact corruption scenario this PR claims to fix.

The root cause is that indexed_ids() queries id_to_namespace, which only tracks what was added via index_in_namespace/remove, not what is stored inside the underlying hnsw_backend. To correctly enumerate stale HNSW entries, a method would need to be added to HnswBackend to list its keys (if the usearch crate supports that), or the checkpoint save/load process must also save the ID-to-namespace mapping alongside the HNSW binary, so that indexed_ids() can be populated from that mapping on load.

Because existing_ids is empty before the insertion loop, the "skip already-indexed IDs" optimization in the insertion loop (line 880: if !existing_ids.contains(&entry.id)) is also a no-op on every recovery — all entries are unconditionally re-added, which doubles writes to the HNSW backend (once from disk and once from the re-indexing loop).

Copilot · 2026-03-04T13:36:48Z

crates/cortexadb-core/tests/integration.rs

+    // Perform an HNSW search to ensure the vector index was properly synced during recovery
+    let hits = db.ask(vec![1.0, 0.0, 0.0], 5, None).unwrap();
+    assert!(!hits.is_empty());
+    assert_eq!(hits[0].id, id_target, "top hit should be the post-checkpoint entry");


The test does not adequately verify that stale entries (those deleted after the checkpoint but before the crash) are absent from HNSW search results. It only asserts that hits[0].id == id_target, but since the deleted vector [0.0, 0.0, 1.0] is orthogonal to the query [1.0, 0.0, 0.0], it is unlikely to appear as the top hit regardless of whether the stale-removal fix actually works.

To properly validate the stale-removal behavior, the test should either:

Assert that id_deleted does not appear anywhere in the hits results, or

Use a query vector that is closer to id_deleted's vector than to the others, so the test would fail if the stale entry contaminates search results.

Suggested change

assert_eq!(hits[0].id, id_target, "top hit should be the post-checkpoint entry");

assert_eq!(hits[0].id, id_target, "top hit should be the post-checkpoint entry");

assert!(

hits.iter().all(|h| h.id != id_deleted),

"deleted entry should not appear in HNSW search results"

);

Copilot · 2026-03-04T13:36:48Z

crates/cortexadb-core/src/store.rs

        for entry in state_machine.all_memories() {
            if let Some(embedding) = entry.embedding.clone() {
-                indexes.vector_index_mut().index_in_namespace(
-                    &entry.namespace,
-                    entry.id,
-                    embedding,
-                )?;
+                if !existing_ids.contains(&entry.id) {
+                    indexes.vector_index_mut().index_in_namespace(
+                        &entry.namespace,
+                        entry.id,
+                        embedding,
+                    )?;
+                }
            }
        }
+
+        // Remove IDs from HNSW that are no longer in the state machine (e.g. they were deleted in the replayed WAL)
+        if has_loaded_hnsw {
+            let state_ids: HashSet<MemoryId> = state_machine
+                .all_memories()
+                .into_iter()
+                .filter(|e| e.embedding.is_some())
+                .map(|e| e.id)
+                .collect();


state_machine.all_memories() is called twice in build_vector_index when has_loaded_hnsw is true: once at line 878 and once at line 893. Each call allocates a new Vec, copies all entry references, and sorts them in O(n log n). These two calls could be merged into a single iteration that simultaneously collects state_ids and performs the insertion loop, eliminating one full traversal and allocation. At startup with a large database, this doubles the cost of what should be a one-pass operation.

Copilot · 2026-03-04T13:36:49Z

crates/cortexadb-core/tests/integration.rs

+// HNSW Recovery sync tests
+// ---------------------------------------------------------------------------
+
+#[test]


All other integration tests that call db.checkpoint() are annotated with #[serial] (see test_checkpoint_recovery_preserves_all_entries, test_double_checkpoint_recovery, and test_delete_then_checkpoint_recovery). The new test calls db.checkpoint() but does not have #[serial]. While each test uses its own TempDir, the convention in this file is to mark all checkpoint-involving tests with #[serial].

Suggested change

#[test]

#[test]

#[serial]

fix(core): sync missing vectors from state machine to HNSW on recovery

a058fc8

Copilot AI review requested due to automatic review settings March 4, 2026 13:32

Copilot started reviewing on behalf of anaslimem March 4, 2026 13:32 View session

anaslimem merged commit 5c613ba into main Mar 4, 2026
3 checks passed

Copilot AI reviewed Mar 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(core): sync missing vectors from state machine to HNSW on recovery#11

fix(core): sync missing vectors from state machine to HNSW on recovery#11
anaslimem merged 1 commit intomainfrom
fix_hnsw

anaslimem commented Mar 4, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 4, 2026

Uh oh!

Copilot AI Mar 4, 2026

Uh oh!

Copilot AI Mar 4, 2026

Uh oh!

Copilot AI Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anaslimem commented Mar 4, 2026

Description

Changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants