Skip to content

fix(core): sync missing vectors from state machine to HNSW on recovery#11

Merged
anaslimem merged 1 commit intomainfrom
fix_hnsw
Mar 4, 2026
Merged

fix(core): sync missing vectors from state machine to HNSW on recovery#11
anaslimem merged 1 commit intomainfrom
fix_hnsw

Conversation

@anaslimem
Copy link
Owner

Description

Fixes a severe bug where the database crashes on restart with a vector index mismatch error when recovering from a crash.

During recovery, the engine replays WAL entries, but the HNSW index loaded from disk naturally doesn't have the uncheckpointed vectors. This PR cross-references the loaded HNSW index against the replayed State Machine, and dynamically adds missing vectors / removes stale vectors before finalizing the store open.

Changes

  • Updated CortexaDBStore::build_vector_index to hydrate missing vectors and remove stale ones.
  • Added test_hnsw_recovery_sync() to integration tests.

Copilot AI review requested due to automatic review settings March 4, 2026 13:32
@anaslimem anaslimem merged commit 5c613ba into main Mar 4, 2026
3 checks passed
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a crash-on-restart bug where HNSW-mode databases would fail with a vector index mismatch error during WAL replay recovery. The fix cross-references the loaded HNSW index against the replayed state machine to add missing vectors and remove stale ones before the store is finalized.

Changes:

  • build_vector_index is updated to conditionally skip re-indexing entries already present in the loaded HNSW, and to remove entries from HNSW that were deleted via WAL replay after the last checkpoint.
  • Integration test test_hnsw_recovery_sync is added to verify that post-checkpoint insertions survive recovery and are correctly searchable.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
crates/cortexadb-core/src/store.rs Core recovery sync logic: skip already-indexed entries and remove stale ones from the HNSW on startup
crates/cortexadb-core/tests/integration.rs New integration test exercising HNSW checkpoint + crash + recovery with post-checkpoint insert and delete

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +875 to +903
let existing_ids: HashSet<MemoryId> = indexes.vector.indexed_ids().into_iter().collect();

// Add or update missing embeddings from the state machine into the loaded HNSW
for entry in state_machine.all_memories() {
if let Some(embedding) = entry.embedding.clone() {
indexes.vector_index_mut().index_in_namespace(
&entry.namespace,
entry.id,
embedding,
)?;
if !existing_ids.contains(&entry.id) {
indexes.vector_index_mut().index_in_namespace(
&entry.namespace,
entry.id,
embedding,
)?;
}
}
}

// Remove IDs from HNSW that are no longer in the state machine (e.g. they were deleted in the replayed WAL)
if has_loaded_hnsw {
let state_ids: HashSet<MemoryId> = state_machine
.all_memories()
.into_iter()
.filter(|e| e.embedding.is_some())
.map(|e| e.id)
.collect();

for existing_id in existing_ids {
if !state_ids.contains(&existing_id) {
let _ = indexes.vector_index_mut().remove(existing_id);
}
}
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stale-entry removal block (lines 891–904) is fundamentally ineffective. It relies on existing_ids, which is computed from indexes.vector.indexed_ids() (line 875) immediately after IndexLayer::new_with_loaded_hnsw(...). At that point, VectorIndex's id_to_namespace map is empty (initialized as HashMap::new() in new_with_loaded_hnsw), so indexed_ids() always returns an empty Vec, and existing_ids is always an empty HashSet.

As a result, stale vectors that are present in the on-disk HNSW backend (e.g., entries deleted via WAL replay after the checkpoint) are never removed from the HNSW backend. They will continue to surface in HNSW-based searches as ghost results, which is the exact corruption scenario this PR claims to fix.

The root cause is that indexed_ids() queries id_to_namespace, which only tracks what was added via index_in_namespace/remove, not what is stored inside the underlying hnsw_backend. To correctly enumerate stale HNSW entries, a method would need to be added to HnswBackend to list its keys (if the usearch crate supports that), or the checkpoint save/load process must also save the ID-to-namespace mapping alongside the HNSW binary, so that indexed_ids() can be populated from that mapping on load.

Because existing_ids is empty before the insertion loop, the "skip already-indexed IDs" optimization in the insertion loop (line 880: if !existing_ids.contains(&entry.id)) is also a no-op on every recovery — all entries are unconditionally re-added, which doubles writes to the HNSW backend (once from disk and once from the re-indexing loop).

Copilot uses AI. Check for mistakes.
// Perform an HNSW search to ensure the vector index was properly synced during recovery
let hits = db.ask(vec![1.0, 0.0, 0.0], 5, None).unwrap();
assert!(!hits.is_empty());
assert_eq!(hits[0].id, id_target, "top hit should be the post-checkpoint entry");
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test does not adequately verify that stale entries (those deleted after the checkpoint but before the crash) are absent from HNSW search results. It only asserts that hits[0].id == id_target, but since the deleted vector [0.0, 0.0, 1.0] is orthogonal to the query [1.0, 0.0, 0.0], it is unlikely to appear as the top hit regardless of whether the stale-removal fix actually works.

To properly validate the stale-removal behavior, the test should either:

  • Assert that id_deleted does not appear anywhere in the hits results, or
  • Use a query vector that is closer to id_deleted's vector than to the others, so the test would fail if the stale entry contaminates search results.
Suggested change
assert_eq!(hits[0].id, id_target, "top hit should be the post-checkpoint entry");
assert_eq!(hits[0].id, id_target, "top hit should be the post-checkpoint entry");
assert!(
hits.iter().all(|h| h.id != id_deleted),
"deleted entry should not appear in HNSW search results"
);

Copilot uses AI. Check for mistakes.
Comment on lines 878 to +897
for entry in state_machine.all_memories() {
if let Some(embedding) = entry.embedding.clone() {
indexes.vector_index_mut().index_in_namespace(
&entry.namespace,
entry.id,
embedding,
)?;
if !existing_ids.contains(&entry.id) {
indexes.vector_index_mut().index_in_namespace(
&entry.namespace,
entry.id,
embedding,
)?;
}
}
}

// Remove IDs from HNSW that are no longer in the state machine (e.g. they were deleted in the replayed WAL)
if has_loaded_hnsw {
let state_ids: HashSet<MemoryId> = state_machine
.all_memories()
.into_iter()
.filter(|e| e.embedding.is_some())
.map(|e| e.id)
.collect();
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

state_machine.all_memories() is called twice in build_vector_index when has_loaded_hnsw is true: once at line 878 and once at line 893. Each call allocates a new Vec, copies all entry references, and sorts them in O(n log n). These two calls could be merged into a single iteration that simultaneously collects state_ids and performs the insertion loop, eliminating one full traversal and allocation. At startup with a large database, this doubles the cost of what should be a one-pass operation.

Copilot uses AI. Check for mistakes.
// HNSW Recovery sync tests
// ---------------------------------------------------------------------------

#[test]
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All other integration tests that call db.checkpoint() are annotated with #[serial] (see test_checkpoint_recovery_preserves_all_entries, test_double_checkpoint_recovery, and test_delete_then_checkpoint_recovery). The new test calls db.checkpoint() but does not have #[serial]. While each test uses its own TempDir, the convention in this file is to mark all checkpoint-involving tests with #[serial].

Suggested change
#[test]
#[test]
#[serial]

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants