Improve performance of delete_batch by G-D-Petrov · Pull Request #2999 · man-group/ArcticDB

G-D-Petrov · 2026-03-31T08:29:32Z

Reference Issues/PRs

Monday ticket ref: 11637143232

What does this implement or fix?

ArcticDB `delete_batch` Performance Optimization

Problem

delete_batch was up to 2x slower than a naive parallel multiprocessing approach (one process per symbol deletion).

Root Causes

Three sequential bottlenecks in the delete_trees_responsibly code path:

Sequential check_reload -- version map reload for each symbol was done in a loop, one at a time
Sequential recurse_index_keys -- index segments were read one at a time to discover data keys
Sequential remove_keys -- S3 DeleteObjects calls were batched by key type (3 groups), chained sequentially via .thenValue

Changes

1. Parallel version map reload (`local_versioned_engine.cpp`)

Replaced sequential version_map->check_reload() loop with parallel IO tasks by submitting CheckReloadTask per unique symbol to the IO thread pool

2. Parallel index key recursion (`key_utils.hpp`) - Main change resulting in ~2x improvement

Changed recurse_index_keys to read all index segments concurrently and then process segments sequentially after all reads complete

3. Parallel key deletion (`local_versioned_engine.cpp`)

Replaced sequential .thenValue chain of 3 remove_keys calls (column_stats, index, data) to run all 3 concurrently:

Each remove_keys submits a RemoveBatchTask to a separate IO thread
Each group contains only one key type, so do_remove_impl's groupBy has a single iteration

Benchmark Results

delete_batch vs Parallel Multiprocessing

Symbols	`delete_batch`	Parallel (N procs)	Winner
50	0.96s (52/s)	1.09s (46/s)	`delete_batch` 1.14x faster
100	1.48s (67/s)	2.95s (34/s)	`delete_batch` 1.99x faster
1000	3.88s (257/s)	(did not complete)	`delete_batch` wins

All of the results are from testing on an EC2 machine against an S3 bucket in the same region.
And the number of CPU threads was set to (Number of Symbols * Number of CPU Cores).

Any other comments?

After the change, delete_batch has about the same performance as write_batch.

Checklist

Checklist for code changes...

Have you updated the relevant docstrings, documentation and copyright notice?
Is this contribution tested against all ArcticDB's features?
Do all exceptions introduced raise appropriate error messages?
Are API changes highlighted in the PR description?
Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

…that there are no deadlocks

claude · 2026-04-02T07:09:29Z

cpp/arcticdb/util/key_utils.hpp

+        }
+    }
+
+    auto read_results = folly::collectAll(read_futures).get();


The parallel reads in recurse_index_keys only parallelize the top-level index reads. When an input key is a MULTI_KEY, processing its segment at line 225 calls recurse_segment, which in turn calls recurse_index_key synchronously (via store->read_sync) for each nested TABLE_INDEX or MULTI_KEY found inside the multi-key segment. This means nested index reads remain sequential and block the calling thread (which may be the thread waiting on folly::collectAll(...).get()).

For typical delete_batch workloads with only TABLE_INDEX keys this is not a problem, but for snapshots or multi-key chains the speedup is incomplete, and deep nesting could block the thread pool for an extended period. This is worth a comment to set expectations, and potentially a future-work note.

claude · 2026-04-02T07:09:46Z

cpp/arcticdb/version/local_versioned_engine.cpp

-                                      log::version().debug("Data keys deleted.");
-                                      return folly::Unit();
-                                  });
+        remove_keys_fut = folly::collectAll(


Deleting column-stats keys, index keys, and data keys in parallel removes the previous ordering guarantee. The original sequential chain (column_stats → index → data) was presumably intentional: if a crash or partial failure occurs mid-delete, a reader that finds an index key can still reconstruct the data. With parallel deletion, it is possible for data keys to be removed while the index key still exists, leaving a dangling reference.

ArcticDB's ignores_missing_key_ flag on reads (read_opts) provides some crash-tolerance, but the new parallel ordering is a subtle correctness trade-off worth documenting. At minimum, add a comment here explaining why parallel deletion is safe (or acceptable) in this context.

Also note that folly::collectAll returns all results regardless of individual failures. If one of the three remove_keys calls fails, its error is captured in the corresponding Try<>, but the .thenValue lambda discards them all with auto&&. Errors from individual removal groups will be silently swallowed — the caller will see a successful folly::Unit. The previous sequential chain at least propagated the first failure. Consider checking each result in the .thenValue lambda:

Suggested change

remove_keys_fut = folly::collectAll(

remove_keys_fut = folly::collectAll(

store->remove_keys(std::move(vks_column_stats), remove_opts),

store->remove_keys(std::move(vks_to_delete), remove_opts),

store->remove_keys(std::move(vks_data_to_delete), remove_opts)

)

.via(&async::io_executor())

.thenValue([](auto&& results) {

// Re-throw first error, if any

std::get<0>(results).throwUnlessValue();

std::get<1>(results).throwUnlessValue();

std::get<2>(results).throwUnlessValue();

return folly::Unit();

});

claude · 2026-04-02T07:09:54Z

cpp/arcticdb/version/local_versioned_engine.cpp

+                        async::submit_io_task(CheckReloadTask{store, version_map, min.first, load_strategy})
+                );
+            }
+            auto results = folly::collectAll(reload_futures).get();


folly::collectAll(...).get() is called here on the calling thread. If this code is already executing inside a Folly thread pool callback (e.g., an IO or CPU task), calling .get() blocks that thread until all CheckReloadTask futures complete. This is the classic Folly deadlock pattern when the thread pool is saturated: all threads are blocked waiting for tasks that are queued but can never start.

The existing pattern elsewhere in version_map_batch_methods.hpp avoids .get() inside task chains. Please verify this call site is always reached from a non-pool thread (e.g., Python-side call path), or restructure to avoid the blocking .get() if it can be reached from within a pool thread.

claude · 2026-04-02T07:10:01Z

cpp/arcticdb/util/key_utils.hpp

+        }
+    }
+
+    auto read_results = folly::collectAll(read_futures).get();


Same deadlock risk as the check_reload change: folly::collectAll(read_futures).get() blocks the calling thread. recurse_index_keys is called from delete_trees_responsibly, which in turn is called in some paths from within future chains (e.g., delete_unreferenced_pruned_indexes at line 158 of local_versioned_engine.cpp is itself a .thenValue callback). If recurse_index_keys is reached from inside a thread pool thread, this .get() can deadlock when the pool is saturated.

Verify the full call graph to confirm this .get() is never reached from within a pool thread, or add a comment explaining why it is safe.

claude · 2026-04-02T07:10:07Z

python/tests/conftest.py

+@pytest.fixture(
+    params=[True, False],
+)
+def check_single_threaded(request):


The fixture name check_single_threaded reads as though it checks (asserts) something about single-threadedness, but it actually configures the thread count. A name like single_threaded_config or maybe_single_threaded would better express its intent. The yield request.param also exposes the boolean param to tests that don't use it; tests that only need the configuration side-effect and don't use the yielded value may find this confusing.

claude · 2026-04-02T07:10:13Z

python/tests/conftest.py

+
+    yield request.param
+
+    if request.param:


If the test body raises an exception, the teardown block (if request.param:) still runs. However, if reinit_task_scheduler() in the setup (line 144) itself throws, the yield is never reached, meaning teardown is also skipped — the config remains set at 1 thread for the rest of the test session. Consider using a try/finally pattern or a separate autouse fixture to ensure cleanup always happens.

claude · 2026-04-02T07:10:22Z

python/benchmarks/modification_functions.py

+        if os.path.exists(self.ARCTIC_DIR):
+            rmtree(self.ARCTIC_DIR, ignore_errors=True)
+
+    def time_delete_batch_symbols(self, *args):


DeleteBatchVersions.setup prepares delete_requests to delete versions 0 through versions_per_symbol - 3 (i.e., 8 versions for versions_per_symbol=10), leaving version versions_per_symbol - 2 (version 8) and the latest (version 9). But the comment says "all but the latest version" which is misleading — it leaves two versions. This is minor but could confuse future benchmark readers. Also, self.delete_requests deletes version 0 through 7; version 8 (second-to-last) is never deleted, so the benchmark doesn't exercise the "delete all but latest" scenario it claims to test.

claude · 2026-04-02T07:10:29Z

cpp/arcticdb/version/local_versioned_engine.cpp

                                store(), version_map, index_key, std::move(stream_update_info.previous_index_key_)
                        }
                )
+                        .via(&async::cpu_executor())


Adding .via(&async::cpu_executor()) here routes the thenValue callback (which calls delete_unreferenced_pruned_indexes) onto the CPU thread pool. delete_unreferenced_pruned_indexes itself calls delete_trees_responsibly, which now contains folly::collectAll(...).get() calls (both for check_reload and for recurse_index_keys). Blocking inside a CPU pool callback is exactly the deadlock scenario described in guideline §10. This makes the deadlock risk from the two .get() calls more acute — please verify under load that the CPU pool cannot saturate.

claude · 2026-04-02T07:11:07Z

ArcticDB Code Review Summary

PR: #2999 — Improve performance of delete_batch
Reviewer: claude[bot]
Scope: Performance optimization — parallelizing three sequential bottlenecks in delete_trees_responsibly
Last updated: synchronize @ 1e81ffa (sync benchmarks.json code snapshot with fixed modification_functions.py)

API & Compatibility

No breaking changes to public Python API
On-disk format unchanged
Protobuf schema unchanged
Key types and ValueType enum unchanged

Memory & Safety

RAII used for all resource management
No use-after-move patterns (segments are correctly moved into futures)
No accidental copies of large objects
GIL not relevant (pure C++ path)

Correctness

MULTI_KEY handling: the thenValue callback correctly dispatches on index_key.type() — MULTI_KEY calls recurse_segment, TABLE_INDEX uses KeySegment::materialise(). Both paths produce correct key sets.
PerKeyResult struct cleanly separates atom_keys and packed_keys; merge loop in the calling thread correctly inserts into res and res_packed.
same_stream_id captured by value in the lambda — correct, since the lambda may execute after the calling frame returns.
Silent error swallowing fixed: throwUnlessValue() called for all three remove_keys results.
Deletion ordering documented.

Async & Threading

Deadlock risk addressed: both .get() call sites have analysis comments. folly::collectAll(process_futures).get() still blocks the calling thread, but process_futures are fulfilled by the IO executor — safe because IO and CPU executors are separate.
Processing moved into thenValue: key_segment.materialise() and recurse_segment() for each key now run on the IO executor thread that fulfills the read future (Folly inline continuation). CPU work mixed with IO work is acceptable here — the materialisation step is lightweight (creating key objects from a pre-read buffer) and the existing .via(&async::cpu_executor()) on the outer call chain is unaffected.
Nested synchronous reads in MULTI_KEY path still present: recurse_segment → recurse_index_key → store->read_sync() remains sequential for nested multi-key chains. Acknowledged limitation, acceptable for now.

Testing

single_threaded_config fixture correctly parametrizes tests
Teardown wrapped in try/finally
Four tests exercise the 1-IO + 1-CPU scenario that would expose deadlocks
Python API names corrected: test_batch_delete_symbols calls lib.batch_delete_symbols(); test_batch_write_with_pruning calls lib.batch_write(...). Both use the public Library V2 API correctly.
No C++ unit test for parallelized recurse_index_keys (existing test_key_utils.cpp covers the sequential case only)
No test for partial-failure error propagation in parallel remove_keys (verifying throwUnlessValue fires on backend error)

Benchmarks

New DeleteBatchSymbols and DeleteBatchVersions ASV benchmarks are well-structured
teardown correctly cleans up both library and LMDB dir
DeleteBatchVersions off-by-one fully resolved: modification_functions.py uses range(versions_per_symbol - 1) and benchmarks.json code snapshot is now in sync (fixed in this commit)

Code Quality

std::all_of replaces manual loop: cleaner same-stream-id check, addresses vasil-pashov's nit.
Processing inlined into thenValue: eliminates the auxiliary read_keys vector and the two-phase (read-then-process) structure. Directly addresses vasil-pashov's review suggestion.
Template parameter renamed from KeyType (shadowed outer enum) to KT in the inner generic lambda — correctness improvement.
No duplicated logic
No hardcoded credentials or secrets

PR Title & Description

Title is clear and accurate
Description explains all three bottlenecks and the fixes
Benchmark results provided
Checklist items in the PR description are all unchecked — worth completing

Summary

This commit is a one-line housekeeping fix: benchmarks.json is updated to mirror the already-corrected modification_functions.py (range(versions_per_symbol - 1) instead of - 2), bringing the benchmark code snapshot in sync with the source. No logic changes.

All previously flagged issues remain resolved. Remaining open items are minor:

C++ unit test for parallelized recurse_index_keys
Failure-injection test for error propagation through throwUnlessValue
Completing the PR description checklist

vasil-pashov · 2026-04-02T14:47:10Z

cpp/arcticdb/util/key_utils.hpp

    for (const auto& index_key : keys) {
-        same_stream_id = first_stream_id == index_key.id();
+        if (first_stream_id != index_key.id()) {
+            same_stream_id = false;
+            break;
+        }
+    }


nit: can use

std::ranges::adjacent_find(index_keys, [&](const AtomKey& k){ return index_key.id() != first_stream_id;});

vasil-pashov · 2026-04-02T14:55:35Z

cpp/arcticdb/util/key_utils.hpp

+    // Process results sequentially
+    ankerl::unordered_dense::set<AtomKey> res;
+    ankerl::unordered_dense::set<AtomKeyPacked> res_packed;
+    for (size_t i = 0; i < read_results.size(); ++i) {


Looks like the body of the loop can be placed in a thenValue of the read future. Each future would have to use its own set and then all sets must be merged, but it seems possible. Do you mind trying?

… the latest version

Improve performance of delete_batch

ab00b37

G-D-Petrov added the patch Small change, should increase patch version label Mar 31, 2026

G-D-Petrov added 6 commits March 31, 2026 14:00

Fix cpu executor

ebd10d5

Add check_single_threaded fixture to batch_delete tests to make sure …

3789c37

…that there are no deadlocks

Fix fixture not available

fcef4b4

Add delete_batch benchmarks

7ef4842

Update benchmakrs.json

3763982

Fix failing benchmark

2057192

G-D-Petrov marked this pull request as ready for review April 2, 2026 07:05

G-D-Petrov requested review from IvoDD, alexowens90 and poodlewars as code owners April 2, 2026 07:05

claude bot reviewed Apr 2, 2026

View reviewed changes

Add more single threaded tests + address PR comments

cdf47d9

vasil-pashov approved these changes Apr 2, 2026

View reviewed changes

G-D-Petrov added 3 commits April 3, 2026 12:01

Address PR comments

1a21c55

Fix formatting

69645c8

Update delete requests in benchmark for DeleteBatchVersions to retain…

1e81ffa

… the latest version

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of delete_batch#2999

Improve performance of delete_batch#2999
G-D-Petrov wants to merge 11 commits intomasterfrom
gpetrov/improve_delete_batch_perf

G-D-Petrov commented Mar 31, 2026 •

edited

Loading

Uh oh!

claude bot Apr 2, 2026

Uh oh!

claude bot Apr 2, 2026

Uh oh!

claude bot Apr 2, 2026

Uh oh!

claude bot Apr 2, 2026

Uh oh!

claude bot Apr 2, 2026

Uh oh!

claude bot Apr 2, 2026

Uh oh!

claude bot Apr 2, 2026

Uh oh!

claude bot Apr 2, 2026

Uh oh!

claude bot commented Apr 2, 2026 •

edited

Loading

Uh oh!

vasil-pashov Apr 2, 2026

Uh oh!

vasil-pashov Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-        remove_keys_fut = folly::collectAll(
+        remove_keys_fut = folly::collectAll(
+                                  store->remove_keys(std::move(vks_column_stats), remove_opts),
+                                  store->remove_keys(std::move(vks_to_delete), remove_opts),
+                                  store->remove_keys(std::move(vks_data_to_delete), remove_opts)
+        )
+                                  .via(&async::io_executor())
+                                  .thenValue([](auto&& results) {
+                                      // Re-throw first error, if any
+                                      std::get<0>(results).throwUnlessValue();
+                                      std::get<1>(results).throwUnlessValue();
+                                      std::get<2>(results).throwUnlessValue();
+                                      return folly::Unit();
+                                  });

Conversation

G-D-Petrov commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement or fix?

ArcticDB delete_batch Performance Optimization

Problem

Root Causes

Changes

1. Parallel version map reload (local_versioned_engine.cpp)

2. Parallel index key recursion (key_utils.hpp) - Main change resulting in ~2x improvement

3. Parallel key deletion (local_versioned_engine.cpp)

Benchmark Results

delete_batch vs Parallel Multiprocessing

Any other comments?

Checklist

Uh oh!

claude bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ArcticDB Code Review Summary

API & Compatibility

Memory & Safety

Correctness

Async & Threading

Testing

Benchmarks

Code Quality

PR Title & Description

Summary

Uh oh!

vasil-pashov Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

vasil-pashov Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

G-D-Petrov commented Mar 31, 2026 •

edited

Loading

ArcticDB `delete_batch` Performance Optimization

1. Parallel version map reload (`local_versioned_engine.cpp`)

2. Parallel index key recursion (`key_utils.hpp`) - Main change resulting in ~2x improvement

3. Parallel key deletion (`local_versioned_engine.cpp`)

claude bot commented Apr 2, 2026 •

edited

Loading