Skip to content

Improve performance of delete_batch#2999

Open
G-D-Petrov wants to merge 11 commits intomasterfrom
gpetrov/improve_delete_batch_perf
Open

Improve performance of delete_batch#2999
G-D-Petrov wants to merge 11 commits intomasterfrom
gpetrov/improve_delete_batch_perf

Conversation

@G-D-Petrov
Copy link
Copy Markdown
Collaborator

@G-D-Petrov G-D-Petrov commented Mar 31, 2026

Reference Issues/PRs

Monday ticket ref: 11637143232

What does this implement or fix?

ArcticDB delete_batch Performance Optimization

Problem

delete_batch was up to 2x slower than a naive parallel multiprocessing approach (one process per symbol deletion).

Root Causes

Three sequential bottlenecks in the delete_trees_responsibly code path:

  1. Sequential check_reload -- version map reload for each symbol was done in a loop, one at a time
  2. Sequential recurse_index_keys -- index segments were read one at a time to discover data keys
  3. Sequential remove_keys -- S3 DeleteObjects calls were batched by key type (3 groups), chained sequentially via .thenValue

Changes

1. Parallel version map reload (local_versioned_engine.cpp)

Replaced sequential version_map->check_reload() loop with parallel IO tasks by submitting CheckReloadTask per unique symbol to the IO thread pool

2. Parallel index key recursion (key_utils.hpp) - Main change resulting in ~2x improvement

Changed recurse_index_keys to read all index segments concurrently and then process segments sequentially after all reads complete

3. Parallel key deletion (local_versioned_engine.cpp)

Replaced sequential .thenValue chain of 3 remove_keys calls (column_stats, index, data) to run all 3 concurrently:

  • Each remove_keys submits a RemoveBatchTask to a separate IO thread
  • Each group contains only one key type, so do_remove_impl's groupBy has a single iteration

Benchmark Results

delete_batch vs Parallel Multiprocessing

Symbols delete_batch Parallel (N procs) Winner
50 0.96s (52/s) 1.09s (46/s) delete_batch 1.14x faster
100 1.48s (67/s) 2.95s (34/s) delete_batch 1.99x faster
1000 3.88s (257/s) (did not complete) delete_batch wins

All of the results are from testing on an EC2 machine against an S3 bucket in the same region.
And the number of CPU threads was set to (Number of Symbols * Number of CPU Cores).

Any other comments?

After the change, delete_batch has about the same performance as write_batch.

Checklist

Checklist for code changes...
  • Have you updated the relevant docstrings, documentation and copyright notice?
  • Is this contribution tested against all ArcticDB's features?
  • Do all exceptions introduced raise appropriate error messages?
  • Are API changes highlighted in the PR description?
  • Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

@G-D-Petrov G-D-Petrov added the patch Small change, should increase patch version label Mar 31, 2026
@G-D-Petrov G-D-Petrov marked this pull request as ready for review April 2, 2026 07:05
}
}

auto read_results = folly::collectAll(read_futures).get();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parallel reads in recurse_index_keys only parallelize the top-level index reads. When an input key is a MULTI_KEY, processing its segment at line 225 calls recurse_segment, which in turn calls recurse_index_key synchronously (via store->read_sync) for each nested TABLE_INDEX or MULTI_KEY found inside the multi-key segment. This means nested index reads remain sequential and block the calling thread (which may be the thread waiting on folly::collectAll(...).get()).

For typical delete_batch workloads with only TABLE_INDEX keys this is not a problem, but for snapshots or multi-key chains the speedup is incomplete, and deep nesting could block the thread pool for an extended period. This is worth a comment to set expectations, and potentially a future-work note.

log::version().debug("Data keys deleted.");
return folly::Unit();
});
remove_keys_fut = folly::collectAll(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleting column-stats keys, index keys, and data keys in parallel removes the previous ordering guarantee. The original sequential chain (column_stats → index → data) was presumably intentional: if a crash or partial failure occurs mid-delete, a reader that finds an index key can still reconstruct the data. With parallel deletion, it is possible for data keys to be removed while the index key still exists, leaving a dangling reference.

ArcticDB's ignores_missing_key_ flag on reads (read_opts) provides some crash-tolerance, but the new parallel ordering is a subtle correctness trade-off worth documenting. At minimum, add a comment here explaining why parallel deletion is safe (or acceptable) in this context.

Also note that folly::collectAll returns all results regardless of individual failures. If one of the three remove_keys calls fails, its error is captured in the corresponding Try<>, but the .thenValue lambda discards them all with auto&&. Errors from individual removal groups will be silently swallowed — the caller will see a successful folly::Unit. The previous sequential chain at least propagated the first failure. Consider checking each result in the .thenValue lambda:

Suggested change
remove_keys_fut = folly::collectAll(
remove_keys_fut = folly::collectAll(
store->remove_keys(std::move(vks_column_stats), remove_opts),
store->remove_keys(std::move(vks_to_delete), remove_opts),
store->remove_keys(std::move(vks_data_to_delete), remove_opts)
)
.via(&async::io_executor())
.thenValue([](auto&& results) {
// Re-throw first error, if any
std::get<0>(results).throwUnlessValue();
std::get<1>(results).throwUnlessValue();
std::get<2>(results).throwUnlessValue();
return folly::Unit();
});

async::submit_io_task(CheckReloadTask{store, version_map, min.first, load_strategy})
);
}
auto results = folly::collectAll(reload_futures).get();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

folly::collectAll(...).get() is called here on the calling thread. If this code is already executing inside a Folly thread pool callback (e.g., an IO or CPU task), calling .get() blocks that thread until all CheckReloadTask futures complete. This is the classic Folly deadlock pattern when the thread pool is saturated: all threads are blocked waiting for tasks that are queued but can never start.

The existing pattern elsewhere in version_map_batch_methods.hpp avoids .get() inside task chains. Please verify this call site is always reached from a non-pool thread (e.g., Python-side call path), or restructure to avoid the blocking .get() if it can be reached from within a pool thread.

}
}

auto read_results = folly::collectAll(read_futures).get();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same deadlock risk as the check_reload change: folly::collectAll(read_futures).get() blocks the calling thread. recurse_index_keys is called from delete_trees_responsibly, which in turn is called in some paths from within future chains (e.g., delete_unreferenced_pruned_indexes at line 158 of local_versioned_engine.cpp is itself a .thenValue callback). If recurse_index_keys is reached from inside a thread pool thread, this .get() can deadlock when the pool is saturated.

Verify the full call graph to confirm this .get() is never reached from within a pool thread, or add a comment explaining why it is safe.

@pytest.fixture(
params=[True, False],
)
def check_single_threaded(request):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fixture name check_single_threaded reads as though it checks (asserts) something about single-threadedness, but it actually configures the thread count. A name like single_threaded_config or maybe_single_threaded would better express its intent. The yield request.param also exposes the boolean param to tests that don't use it; tests that only need the configuration side-effect and don't use the yielded value may find this confusing.


yield request.param

if request.param:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the test body raises an exception, the teardown block (if request.param:) still runs. However, if reinit_task_scheduler() in the setup (line 144) itself throws, the yield is never reached, meaning teardown is also skipped — the config remains set at 1 thread for the rest of the test session. Consider using a try/finally pattern or a separate autouse fixture to ensure cleanup always happens.

if os.path.exists(self.ARCTIC_DIR):
rmtree(self.ARCTIC_DIR, ignore_errors=True)

def time_delete_batch_symbols(self, *args):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeleteBatchVersions.setup prepares delete_requests to delete versions 0 through versions_per_symbol - 3 (i.e., 8 versions for versions_per_symbol=10), leaving version versions_per_symbol - 2 (version 8) and the latest (version 9). But the comment says "all but the latest version" which is misleading — it leaves two versions. This is minor but could confuse future benchmark readers. Also, self.delete_requests deletes version 0 through 7; version 8 (second-to-last) is never deleted, so the benchmark doesn't exercise the "delete all but latest" scenario it claims to test.

store(), version_map, index_key, std::move(stream_update_info.previous_index_key_)
}
)
.via(&async::cpu_executor())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding .via(&async::cpu_executor()) here routes the thenValue callback (which calls delete_unreferenced_pruned_indexes) onto the CPU thread pool. delete_unreferenced_pruned_indexes itself calls delete_trees_responsibly, which now contains folly::collectAll(...).get() calls (both for check_reload and for recurse_index_keys). Blocking inside a CPU pool callback is exactly the deadlock scenario described in guideline §10. This makes the deadlock risk from the two .get() calls more acute — please verify under load that the CPU pool cannot saturate.

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 2, 2026

ArcticDB Code Review Summary

PR: #2999 — Improve performance of delete_batch
Reviewer: claude[bot]
Scope: Performance optimization — parallelizing three sequential bottlenecks in delete_trees_responsibly
Last updated: synchronize @ 1e81ffa (sync benchmarks.json code snapshot with fixed modification_functions.py)


API & Compatibility

  • No breaking changes to public Python API
  • On-disk format unchanged
  • Protobuf schema unchanged
  • Key types and ValueType enum unchanged

Memory & Safety

  • RAII used for all resource management
  • No use-after-move patterns (segments are correctly moved into futures)
  • No accidental copies of large objects
  • GIL not relevant (pure C++ path)

Correctness

  • MULTI_KEY handling: the thenValue callback correctly dispatches on index_key.type()MULTI_KEY calls recurse_segment, TABLE_INDEX uses KeySegment::materialise(). Both paths produce correct key sets.
  • PerKeyResult struct cleanly separates atom_keys and packed_keys; merge loop in the calling thread correctly inserts into res and res_packed.
  • same_stream_id captured by value in the lambda — correct, since the lambda may execute after the calling frame returns.
  • Silent error swallowing fixed: throwUnlessValue() called for all three remove_keys results.
  • Deletion ordering documented.

Async & Threading

  • Deadlock risk addressed: both .get() call sites have analysis comments. folly::collectAll(process_futures).get() still blocks the calling thread, but process_futures are fulfilled by the IO executor — safe because IO and CPU executors are separate.
  • Processing moved into thenValue: key_segment.materialise() and recurse_segment() for each key now run on the IO executor thread that fulfills the read future (Folly inline continuation). CPU work mixed with IO work is acceptable here — the materialisation step is lightweight (creating key objects from a pre-read buffer) and the existing .via(&async::cpu_executor()) on the outer call chain is unaffected.
  • Nested synchronous reads in MULTI_KEY path still present: recurse_segmentrecurse_index_keystore->read_sync() remains sequential for nested multi-key chains. Acknowledged limitation, acceptable for now.

Testing

  • single_threaded_config fixture correctly parametrizes tests
  • Teardown wrapped in try/finally
  • Four tests exercise the 1-IO + 1-CPU scenario that would expose deadlocks
  • Python API names corrected: test_batch_delete_symbols calls lib.batch_delete_symbols(); test_batch_write_with_pruning calls lib.batch_write(...). Both use the public Library V2 API correctly.
  • No C++ unit test for parallelized recurse_index_keys (existing test_key_utils.cpp covers the sequential case only)
  • No test for partial-failure error propagation in parallel remove_keys (verifying throwUnlessValue fires on backend error)

Benchmarks

  • New DeleteBatchSymbols and DeleteBatchVersions ASV benchmarks are well-structured
  • teardown correctly cleans up both library and LMDB dir
  • DeleteBatchVersions off-by-one fully resolved: modification_functions.py uses range(versions_per_symbol - 1) and benchmarks.json code snapshot is now in sync (fixed in this commit)

Code Quality

  • std::all_of replaces manual loop: cleaner same-stream-id check, addresses vasil-pashov's nit.
  • Processing inlined into thenValue: eliminates the auxiliary read_keys vector and the two-phase (read-then-process) structure. Directly addresses vasil-pashov's review suggestion.
  • Template parameter renamed from KeyType (shadowed outer enum) to KT in the inner generic lambda — correctness improvement.
  • No duplicated logic
  • No hardcoded credentials or secrets

PR Title & Description

  • Title is clear and accurate
  • Description explains all three bottlenecks and the fixes
  • Benchmark results provided
  • Checklist items in the PR description are all unchecked — worth completing

Summary

This commit is a one-line housekeeping fix: benchmarks.json is updated to mirror the already-corrected modification_functions.py (range(versions_per_symbol - 1) instead of - 2), bringing the benchmark code snapshot in sync with the source. No logic changes.

All previously flagged issues remain resolved. Remaining open items are minor:

  • C++ unit test for parallelized recurse_index_keys
  • Failure-injection test for error propagation through throwUnlessValue
  • Completing the PR description checklist

Comment on lines +192 to +197
for (const auto& index_key : keys) {
same_stream_id = first_stream_id == index_key.id();
if (first_stream_id != index_key.id()) {
same_stream_id = false;
break;
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can use

std::ranges::adjacent_find(index_keys, [&](const AtomKey& k){ return index_key.id() != first_stream_id;});

// Process results sequentially
ankerl::unordered_dense::set<AtomKey> res;
ankerl::unordered_dense::set<AtomKeyPacked> res_packed;
for (size_t i = 0; i < read_results.size(); ++i) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the body of the loop can be placed in a thenValue of the read future. Each future would have to use its own set and then all sets must be merged, but it seems possible. Do you mind trying?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

patch Small change, should increase patch version

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants