Enhancement/11460886943/dense numeric static schema compaction by alexowens90 · Pull Request #2985 · man-group/ArcticDB

alexowens90 · 2026-03-24T17:15:22Z

Reference Issues/PRs

What does this implement or fix?

Adds a new compact_data_experimental API for performing compaction of data keys. After compaction, every segment will have rows_per_segment ± 33% rows, with rows_per_segment defaulting to the lib config setting.

Currently limited to dense, numeric data with static schema. Will raise a SchemaException if any of these criteria are not met. These limitations will gradually be removed in future PRs.

Full profiling and optimisation will come later, but a basic smokescreen comparing to the existing defragment_symbol_data method was done to make sure there aren't any clowns in the car (all with default lib config slicing policies):

Num rows	Input row slices	Input cols	Input col slices	`defragment_symbol_data`	`compact_data_experimental`
100,000	100	10	1	0.31s	0.26s
1,000,000	100	10	1	0.36s	0.31s
10,000	100	1,000	8	1.89s	1.80s

Note than in the last benchmark, defragment_symbol_data removes the column slicing.

cpp/arcticdb/column_store/segment_reslicer.cpp

python/tests/unit/arcticdb/version_store/test_compact_data.py

python/arcticdb/version_store/library.py

cpp/arcticdb/processing/clause.cpp

python/tests/unit/arcticdb/version_store/test_compact_data.py

cpp/arcticdb/processing/clause.cpp

claude · 2026-03-26T13:55:21Z

ArcticDB Code Review Summary

PR adds a new compact_data_experimental API. The latest commit (a65a94d) adds a first_rows_to_discard cache to complement first_rows_to_keep, eliminating redundant binary searches for discarded rows.

API & Compatibility

No breaking changes to existing public Python API
Deprecation protocol not needed (new API only)
On-disk format unchanged
Protobuf schema unchanged
Key types and ValueType enum unchanged

Memory & Safety

RAII used throughout
GIL correctly released before blocking .get() call
Memory freed early by design
segment_reslicer.cpp:61: Fixed - shared_ptr::unique() replaced with use_count()==1 (commit 98086a1)

Correctness

SlicingInfo correctly distributes rows across segments
structure_row_ranges handles all edge cases
Multi-block column input handled correctly
Untouched segments passed through without re-writing
FAIL clause.cpp:~2402: row_range iterator incremented without bounds check, UB if invariant violated. Enhancement/11460886943/dense numeric static schema compaction #2985 (comment)

Performance

version_core.cpp (c99ab1a): O(n*m) linear scan replaced with O(n log m) binary search + first_rows_to_keep cache
version_core.cpp (a65a94d): first_rows_to_discard cache added. Binary search now runs at most once per unique first_row whether kept or discarded. Eliminates redundant lookups for column-sliced symbols. Addresses prior review note.

Code Quality

flatten_vectors template correctly generalised
SegmentReslicer cleanly separated from CompactDataClause
collection_utils.hpp utilities are clean and reusable
NOTE: prune_previous_versions vs prune_previous_version naming is consistent with each class convention. Enhancement/11460886943/dense numeric static schema compaction #2985 (comment)

Testing

Comprehensive Python unit tests (appends, updates, column slicing, multiindex, idempotency, metadata)
Hypothesis and RapidCheck property tests
C++ unit tests for combine/split/multi-block
Integration test in test_arctic.py
test_compact_data.py:324: Fixed - duplicate symbol name resolved (commit 98086a1)
FAIL test_compact_data.py:86: KeyError if no reads occurred. Use .get() pattern. Enhancement/11460886943/dense numeric static schema compaction #2985 (comment)
FAIL test_segment_reslicer.cpp:19: Commented-out debug line must be removed before merge. Enhancement/11460886943/dense numeric static schema compaction #2985 (comment)
NOTE RapidCheck StructureForProcessing does not exercise oversized segments needing splitting. Enhancement/11460886943/dense numeric static schema compaction #2985 (comment)

Build & Dependencies

All new source files added to CMakeLists.txt
No new external dependencies

Security

No hardcoded credentials
Input validated at Python boundary (rows_per_segment > 0)
E_SYMBOL_NOT_FOUND check in place

PR Title & Description

Description explains what changed, why, and limitations
Benchmark comparison included
NOTE Title uses branch name format rather than imperative verb
NOTE No labels set yet; enhancement would be appropriate

cpp/arcticdb/version/version_core.cpp

python/tests/unit/arcticdb/version_store/test_compact_data.py

python/arcticdb/version_store/library.py

cpp/arcticdb/column_store/segment_reslicer.hpp

cpp/arcticdb/column_store/segment_reslicer.cpp

cpp/arcticdb/processing/test/rapidcheck_compact_data.cpp

cpp/arcticdb/column_store/test/test_segment_reslicer.cpp

…eric data with static schema

claude · 2026-04-01T16:26:52Z

cpp/arcticdb/util/collection_utils.hpp

+
+#include <memory>
+#include <ranges>
+#include <vector>


std::accumulate requires <numeric> (not transitively guaranteed across all compilers/platforms). This file only includes <memory>, <ranges>, and <vector>, which may pull it in on GCC but will likely fail on MSVC or Clang. Add the missing include before <ranges>: #include <numeric>

claude · 2026-04-01T16:34:27Z

ArcticDB Code Review Summary

PR: Enhancement: dense numeric static schema compaction (compact_data_experimental)
Commit reviewed: d1edff9

API & Compatibility

✅ No breaking changes to public Python API — new compact_data_experimental method added as explicitly experimental with warning in docstring
✅ Deprecation protocol N/A (new addition)
✅ On-disk format unchanged — uses existing TABLE_DATA / TABLE_INDEX key types and segment format
✅ Protobuf schema unchanged
✅ Key types and ValueType enum unchanged

Memory & Safety

✅ RAII used — ARCTICDB_NO_MOVE_OR_COPY(SegmentReslicer), py::gil_scoped_release in compact_data_internal
✅ No use-after-move — util::extract_from_pointers guards with use_count() == 1 check before moving
✅ Buffer size calculations correct in reslice_dense_numeric_static_schema_columns — uses type_size * row_count exactly
✅ GIL released in compact_data_internal via py::gil_scoped_release before blocking .get()
⚠️ SlicingInfo division by zero — rows_per_segment(total_rows / num_segments) is UB when total_rows == 0 (because num_segments would also be 0). Currently unreachable in practice because structure_for_processing short-circuits empty/zero-row cases to noop before SegmentReslicer is invoked, but the SlicingInfo constructor has no guard. Consider adding util::check(total_rows > 0, ...) or handling the zero case explicitly.

Correctness

✅ structure_row_ranges greedy algorithm handles edge cases (single slice, uniformly fragmented, trailing small slice)
✅ Column slicing handled correctly — row ranges reset per column slice in structure_for_processing
✅ Noop path correct — slices_and_keys.empty() check in compact_data_impl returns nullopt → caller returns existing version key
✅ first_rows_to_keep / first_rows_to_discard caches avoid redundant binary searches for the same row within a column-sliced symbol
⚠️ Cache asymmetry in compact_data_impl (flagged at line 2955 by previous review) — the keep-cache is populated but the discard-cache is not, so the binary search still runs once per column slice for each discarded row. Acknowledged as suboptimal by the author.
⚠️ row_range iterator increment without bounds check in structure_for_processing (flagged at clause.cpp:2412) — if the invariant between retained_processing_row_ranges and ranges_and_keys is ever violated, this advances past cend().

Code Quality

✅ flatten_entities removed and replaced by the more general util::flatten_vectors<T> — good refactor
✅ util::extract_from_pointers / extract_to_pointers are clean C++20 utilities
❌ Missing #include <numeric> in cpp/arcticdb/util/collection_utils.hpp — std::accumulate is not in any of the three included headers; this will likely fail to compile on MSVC and some Clang configurations (inline comment posted)
⚠️ prune_previous_versions (plural) in Library.compact_data_experimental vs. prune_previous_version (singular) in NativeVersionStore.compact_data_experimental — naming inconsistency (flagged at library.py:3192)
⚠️ Duplicate symbol name "test_compact_data_single_row" in test file (flagged at test_compact_data.py:324)

Testing

✅ Comprehensive parametrized unit tests covering append, update, column slicing, multiindex, date range reads, pruning, metadata preservation
✅ Hypothesis test (test_compact_data_hypothesis) covers randomised row counts, column counts, and type combinations
✅ RapidCheck property tests for structure_row_ranges and structure_for_processing
✅ Edge cases: empty DataFrame, single row, total rows == rows_per_segment, pickled data, recursively normalized data (error path), string columns (error path), sparse data (error path)
✅ Integration test in test_arctic.py covers parametrized rows_per_segment and prune_previous_versions
✅ Idempotency verified: generic_compact_data_test calls generic_compact_data_test_noop as a second pass
✅ Storage operation counts asserted via query_stats_operation_count to verify read/write efficiency

Build & Dependencies

✅ New source files (segment_reslicer.cpp, segment_reslicer.hpp, collection_utils.hpp, test files) added to CMakeLists.txt
✅ No new external dependencies
✅ Uses C++20 std::ranges::transform, std::ranges::sort — consistent with existing codebase

Security

✅ No hardcoded credentials
✅ Input validated: rows_per_segment > 0 checked in Python layer; max_rows_per_segment > 0 checked in SegmentReslicer constructor
✅ No buffer overflow risk — all memcpy sizes derived from pre-allocated column sizes

PR Title & Description

✅ Title is clear and references the tracking ticket
✅ Description explains what the feature does, its current limitations, and includes a benchmark comparison against defragment_symbol_data
✅ API change (new experimental method) called out in description
✅ Linked to Monday.com ticket; labelled enhancement and minor
⚠️ Description does not mention that existing defragment_symbol_data / is_symbol_fragmented are being marked for future deprecation (there is a # TODO: Mark these ... as deprecated comment added in _store.py)

Documentation

✅ Docstrings present on both Library.compact_data_experimental and NativeVersionStore.compact_data_experimental with accurate parameter descriptions, return type, and exception table
✅ Experimental warning clearly states limitations (no strings, no dynamic schema, no sparse data)

IvoDD · 2026-04-02T08:09:46Z

python/tests/unit/arcticdb/version_store/test_compact_data.py


 @pytest.mark.parametrize("rows_per_segment", [3, 7, 10])
-def test_compact_data_idempotent(in_memory_store_factory, rows_per_segment):
+def test_compact_data_idempotent(in_memory_store_factory, clear_query_stats, rows_per_segment):


Nit: We can probably remove this test now that we always check for idempotency on every generic_compact_data_test

Yeh fair, will remove

IvoDD · 2026-04-02T09:40:04Z

cpp/arcticdb/version/version_core.cpp

+                            } else if (!first_rows_to_discard.contains(first_row)) {
+                                auto begin = new_row_ranges.cbegin();
+                                auto end = new_row_ranges.cend();
+                                auto mid = begin + std::distance(begin, end) / 2;


I thought std::set iterators are not random access. They can't really be, they can only O(logn) which does not classify as a random access operator according to C++ spec as far as I understand.

So handrolling binary search like this is likely O(logn * logn). Can't we use lower_bound?

I'm probably misunderstanding something as I thought this would not compile.

These are iterators over std::vector<RowRange>, which is created from the original set exactly for this reason.
To use lower bound we would need to craft a less-than operator which takes into account the slightly odd "equality" condition, which I think would be less easy to follow than the existing implementation.

IvoDD · 2026-04-02T09:54:09Z

cpp/arcticdb/column_store/segment_reslicer.cpp

-    while (remaining_rows > 0) {
-        auto rows_to_consume = std::min(remaining_rows, slicing_info.rows_per_segment);
+    for (size_t idx = 0; idx < slicing_info.num_segments; ++idx) {
+        auto rows_to_consume = std::min(slicing_info.rows_in_slice(idx), remaining_rows);


Nit: Might be better to:

auto rows_to_consume = slicing_info.rows_in_slice(idx) util::check(rows_to_consume <= remaining_rows) // With the new remainder tracking this should always be true

vasil-pashov · 2026-04-01T09:04:13Z

python/tests/unit/arcticdb/version_store/test_compact_data.py

+    row_counts = index["end_row"] - index["start_row"]
+    # Definitions taken from CompactDataClause constructor
+    min_rows_per_segment = max((2 * rows_per_segment) // 3, 1)
+    max_rows_per_segment = max((4 * rows_per_segment) // 3, rows_per_segment + 1)


So with this change we're now allowed to have segments with row count larger than the lib setting. I just want to flag that it might become a source of error if there are places in the code assuming that segment row count <= library setting row count is an invariant. I find it a bit odd. It's worth pointing out somewhere why can't we cap at row count as in the config.

Nothing should be assuming that, as it is perfectly valid to change the lib config setting after data has been written to the library.
The idea was that this is the "optimal" slicing that the user wants. Therefore, if you have a 100k row slice, and you append a 1 row slice, then having a slice with 100,001 rows is clearly better than having a 50,000 and a 50,001 slice.

vasil-pashov · 2026-04-01T09:14:39Z

python/tests/unit/arcticdb/version_store/test_compact_data.py

+
+    post_compaction_data_keys = len(index)
+    new_data_keys = len(index[index["version_id"] > vit_before_compaction.version])
+    expected_get_count = pre_compaction_data_keys - (post_compaction_data_keys - new_data_keys)


nit: what do you think about naming this compacted_data_keys

vasil-pashov · 2026-04-01T10:04:50Z

python/tests/unit/arcticdb/version_store/test_compact_data.py

+    received_df = lib.read(sym).data
+    assert_frame_equal(vit_before_compaction.data, received_df)
+    index = lib.read_index(sym)
+    assert len(index) == pre_compaction_data_keys


Can't we go one step further and assert_dataframe_equal the index before and after?

vasil-pashov · 2026-04-01T10:12:00Z

python/tests/unit/arcticdb/version_store/test_compact_data.py

+    assert row_counts.max() <= max_rows_per_segment
+
+    post_compaction_data_keys = len(index)
+    new_data_keys = len(index[index["version_id"] > vit_before_compaction.version])


Now the the number of new keys is computed from the result. Meaning that it's possible to have a bug in the code producing more new keys than needed and end up with valid tests. Is it easy/possible to come up with a deterministic formula to check the expected number of new keys?

I think to do so you would just end up re-implementing the algorithms in CompactDataClause::structure_for_processing. From a testing perspective though, we don't really care how many new data keys there are, the important invariants under test here are:

assert row_counts.min() >= min_rows_per_segment assert row_counts.max() <= max_rows_per_segment

This is why new_data_keys is only used to assert that we didn't write any extra data keys back to storage.

vasil-pashov · 2026-04-01T11:15:12Z

python/tests/unit/arcticdb/version_store/test_compact_data.py

+    expected_col_bc = lib.read(sym, columns=["col_b", "col_c"]).data
+    generic_compact_data_test(lib, sym)
+    assert_frame_equal(expected_col_a, lib.read(sym, columns=["col_a"]).data)
+    assert_frame_equal(expected_col_bc, lib.read(sym, columns=["col_b", "col_c"]).data)


nit: can add reading only the index

The data doesn't have an index?

cpp/arcticdb/processing/clause.cpp

vasil-pashov · 2026-04-02T09:23:07Z

cpp/arcticdb/processing/clause.cpp

+    }
+    // The greedy algorithm in structure_row_ranges can reslice data where all segments have an acceptable number of
+    // rows, which is not desirable, so short-circuit out if this is the case
+    if (std::all_of(row_ranges.begin(), row_ranges.end(), [this](const RowRange& row_range) {


nit: can use std::ranges::all_of

cpp/arcticdb/processing/clause.cpp

vasil-pashov · 2026-04-02T09:59:19Z

cpp/arcticdb/column_store/segment_reslicer.cpp

+}
+
+std::vector<std::optional<Column>> SegmentReslicer::reslice_dense_numeric_static_schema_columns(
+        std::vector<std::optional<std::shared_ptr<Column>>>&& columns, const SlicingInfo& slicing_info


I find std::optionalstd::shared_ptr a bit redundant. Can't we achieve the same thing by making the shared ptr null?

Yeh, I was trying to communicate intent more clearly, but I think the ugliness now outweighs the benefit.

claude · 2026-04-02T13:58:22Z

python/tests/nonreg/arcticdb/version_store/test_nonreg_specific.py

+        arg_1 = df_1
    getattr(lib, method)(arg_0, arg_1, prune_previous_version=arg)

    assert len(lt.find_keys(KeyType.TABLE_INDEX)) == 1 if should_be_pruned else 2


This assertion will always fail when should_be_pruned=False for the compact_data_experimental case.

With the setup above (lib.write(sym, df_0) — 10 rows in a single segment) and rows_per_segment=100_000, compact_data_experimental is always a noop: min_rows_per_segment = max(2*100_000/3, 1) = 66_667, and the single 10-row segment is the only row range, so structure_for_processing erases it (it is already in processing_row_ranges and has ≤ max_rows_per_segment rows), leaving an empty ranges_and_keys. The function returns the existing v0 without writing a new version.

As a result:

should_be_pruned=True: expects 1 key → gets 1 key (passes for the wrong reason)

should_be_pruned=False: expects 2 keys → gets 1 key → fails

To exercise the prune logic, the test needs data that actually requires compaction, e.g. append a second batch first so there are 2 segments whose combined size triggers a rewrite, or use a rows_per_segment that is smaller than the segment already written (≤ 6 so that max_rows_per_segment < 10).

claude · 2026-04-02T14:15:17Z

TEST_PLACEHOLDER_TO_VERIFY_PATCH_WORKS

claude · 2026-04-02T15:55:23Z

python/tests/nonreg/arcticdb/version_store/test_nonreg_specific.py

    getattr(lib, method)(arg_0, arg_1, prune_previous_version=arg)

-    assert len(lt.find_keys(KeyType.TABLE_INDEX)) == 1 if should_be_pruned else 2
+    assert len(lt.find_keys(KeyType.TABLE_INDEX)) == (1 if should_be_pruned else 2)


The operator-precedence fix is correct, but the noop problem flagged in the earlier comment still applies after this delta.

State at the point of the compaction call:

lib.write(sym, df_0) — v0: 1 segment, 10 rows

lib.append(sym, df_1, prune_previous_version=True) — v1: 2 segments, 20 rows total; v0 is pruned → 1 TABLE_INDEX key in store

compact_data_experimental("sym", 100_000, prune_previous_version=arg):

rows_per_segment = 100_000, so min_rows_per_segment = max(2*100_000//3, 1) = 66_667

Both segments (10 rows each) are well below the minimum, but combined they are also below the minimum → greedy algorithm emits a single range covering all 20 rows

That single combined range exactly matches the existing set of segments (structure_for_processing returns an empty ranges_and_keys) → noop, no new version written

Result after the call: still 1 TABLE_INDEX key.

When should_be_pruned=False the assertion == (1 if should_be_pruned else 2) expects 2 keys but finds 1 → test fails.

To make the test exercise the prune-on-compaction path you need an input that forces an actual rewrite. The simplest fix is to use a small rows_per_segment so the 10-row segments are above max_rows_per_segment and must be split:

Suggested change

assert len(lt.find_keys(KeyType.TABLE_INDEX)) == (1 if should_be_pruned else 2)

# Use rows_per_segment=1 so that each 10-row segment exceeds max_rows_per_segment

# and compaction must perform a real rewrite, creating a new TABLE_INDEX version.

lib.append(sym, df_1, prune_previous_version=True)

arg_1 = 1 # rows_per_segment

claude · 2026-04-02T15:56:01Z

ArcticDB Code Review Summary

PR: Enhancement/11460886943 — compact_data_experimental for dense numeric static-schema data
Review scope: Full PR + delta (commit e3b972d — "Fix test_prune_previous_general")

API & Compatibility

No breaking changes to public Python API — new compact_data_experimental method is additive and clearly marked experimental/not semver-stable
On-disk format unchanged
Protobuf schema unchanged
New CompactDataClause is first-clause-only by design (raises on the entity-id overload)

Note: Library.compact_data_experimental uses prune_previous_versions (plural) while NativeVersionStore.compact_data_experimental uses prune_previous_version (singular). This is consistent with each layer's existing convention, but worth documenting (already flagged in inline comment).

Memory & Safety

SegmentReslicer is deliberately destructive and resets input shared_ptrs after copying — use-count guard in ARCTICDB_DEBUG_CHECK is correct (now uses use_count() == 1, not the removed unique())
RAII used throughout; segments.clear() after extracting columns ensures early memory release
GIL released in compact_data_internal via py::gil_scoped_release
reinterpret_cast in frame_slice.hpp hash specializations (lines 352, 359): casting RowRange* / ColRange* to std::pair<size_t,size_t>* is technically UB under strict aliasing even though both types have the same layout. An existing AxisRange::Hasher struct already does this correctly — the new std::hash specializations could delegate to it instead (already noted in existing inline thread).

Correctness

SlicingInfo arithmetic verified with an internal util::check
structure_row_ranges greedy algorithm correctly handles edge cases (single slice, last slice too small)
Binary search in compact_data_impl avoids redundant lookups with first_rows_to_keep / first_rows_to_discard caches
write_version_and_prune_previous called only when a new version is actually written (noop path returns existing version)
test_prune_previous_general still broken after delta (line 361 of test_nonreg_specific.py): the operator-precedence fix is correct, but with rows_per_segment=100_000 and only 20 total rows the compaction is still a noop — no new TABLE_INDEX key is written, so the assertion fails for should_be_pruned=False. The rows_per_segment argument needs to be small enough to force a real rewrite. (New inline comment posted on this delta.)

Code Quality

flatten_entities correctly generalised to util::flatten_vectors<T> and moved to collection_utils.hpp
extract_from_pointers / extract_to_pointers are clean utility functions
query_stats_operation_count helper in util/test.py is a good addition (avoids the KeyError pattern previously noted)
collection_utils.hpp missing #include <numeric>: std::accumulate in flatten_vectors requires <numeric>; the header only includes <memory>, <ranges>, and <vector>. This may compile on GCC/libstdc++ via transitive includes but will likely fail on MSVC/libc++. (Already flagged in existing inline comment.)

Testing

Comprehensive pytest suite in test_compact_data.py (535 lines): edge cases, hypothesis tests, column slicing, multi-index, string/sparse/dynamic-schema error paths
C++ unit tests for SegmentReslicer covering typed columns, multi-block input, combine-and-split
RapidCheck property tests verify structure_row_ranges and structure_for_processing invariants
Integration test in test_arctic.py covers rows_per_segment and prune_previous_versions combinations
test_prune_previous_general noop issue (see Correctness above)

Build & Dependencies

New source files correctly added to CMakeLists.txt (both arcticdb_srcs and test targets)
No new external dependencies introduced

Security

No hardcoded credentials
Input validated: rows_per_segment > 0 checked in Python before C++ call

PR Title & Description

Title is clear and linked to ticket
Description explains what's implemented and current limitations
Performance benchmarks compared against defragment_symbol_data included
Limitations (string columns, dynamic schema, sparse data) called out

Outstanding Items Requiring Action

Severity	Location	Issue
Bug	`test_nonreg_specific.py:361`	`test_prune_previous_general` still fails for `compact_data_experimental` + `should_be_pruned=False` after the delta fix
Medium	`collection_utils.hpp:21`	Missing `#include <numeric>` for `std::accumulate`
Low	`frame_slice.hpp:352,359`	`reinterpret_cast` in hash specializations; prefer delegating to existing `AxisRange::Hasher`

vasil-pashov · 2026-04-03T09:59:59Z

cpp/arcticdb/util/collection_utils.hpp

@@ -27,7 +27,7 @@ inline std::vector<T> flatten_vectors(std::vector<std::vector<T>>&& vec_of_vecs)
    std::vector<T> res;
    res.reserve(res_size);
    for (const auto& vec : vec_of_vecs) {


This must be a reference. Otherwise the use of move iterators below will default to copying.

vasil-pashov · 2026-04-03T10:25:03Z

cpp/arcticdb/pipeline/frame_slice.hpp

+    inline arcticdb::HashedValue operator()(const arcticdb::pipelines::ColRange& col_range) const noexcept {
+        return folly::hash::hash_combine(col_range.first, col_range.second);
+    }
+};


Now that #include <folly/hash/Hash.h> is included we can remove these two as it handles it.

alexowens90 self-assigned this Mar 24, 2026

alexowens90 added enhancement New feature or request minor Feature change, should increase minor version labels Mar 24, 2026

alexowens90 force-pushed the enhancement/11460886943/dense-numeric-static-schema-compaction branch from a528945 to fd3724d Compare March 25, 2026 10:45

alexowens90 changed the title ~~WIP Enhancement/11460886943/dense numeric static schema compaction~~ Enhancement/11460886943/dense numeric static schema compaction Mar 26, 2026

alexowens90 marked this pull request as ready for review March 26, 2026 13:46

alexowens90 requested review from IvoDD and poodlewars as code owners March 26, 2026 13:46

claude bot reviewed Mar 26, 2026

View reviewed changes

cpp/arcticdb/column_store/segment_reslicer.cpp Outdated Show resolved Hide resolved

claude bot reviewed Mar 26, 2026

View reviewed changes

python/tests/unit/arcticdb/version_store/test_compact_data.py Outdated Show resolved Hide resolved

claude bot reviewed Mar 26, 2026

View reviewed changes

python/arcticdb/version_store/library.py Show resolved Hide resolved

claude bot reviewed Mar 26, 2026

View reviewed changes

cpp/arcticdb/processing/clause.cpp Show resolved Hide resolved

claude bot reviewed Mar 26, 2026

View reviewed changes

python/tests/unit/arcticdb/version_store/test_compact_data.py Outdated Show resolved Hide resolved

claude bot reviewed Mar 26, 2026

View reviewed changes

cpp/arcticdb/processing/clause.cpp Show resolved Hide resolved

claude bot reviewed Mar 27, 2026

View reviewed changes

cpp/arcticdb/version/version_core.cpp Show resolved Hide resolved

IvoDD reviewed Mar 27, 2026

View reviewed changes

alexowens90 commented Mar 30, 2026

View reviewed changes

cpp/arcticdb/column_store/test/test_segment_reslicer.cpp Outdated Show resolved Hide resolved

alexowens90 added 2 commits March 31, 2026 09:20

Enhancement 11460886943: compact_data_experimental impl for dense num…

0dd6428

…eric data with static schema

Swap linear for binary search when creating index key after compaction

c99ab1a

alexowens90 force-pushed the enhancement/11460886943/dense-numeric-static-schema-compaction branch from 873e3a8 to c99ab1a Compare March 31, 2026 08:20

alexowens90 mentioned this pull request Mar 31, 2026

Use column stats at read time #2958

Open

alexowens90 added 2 commits March 31, 2026 09:57

Address Claude's comment

a65a94d

Address Ivo's comments

d1edff9

claude bot reviewed Apr 1, 2026

View reviewed changes

IvoDD reviewed Apr 2, 2026

View reviewed changes

vasil-pashov reviewed Apr 2, 2026

View reviewed changes

Address Ivo's comments 2

68db9a4

IvoDD approved these changes Apr 2, 2026

View reviewed changes

claude bot reviewed Apr 2, 2026

View reviewed changes

alexowens90 added 2 commits April 2, 2026 16:35

Address Vasil's comments

de567e4

Fix test_prune_previous_general

e3b972d

claude bot reviewed Apr 2, 2026

View reviewed changes

vasil-pashov approved these changes Apr 3, 2026

View reviewed changes

-    assert len(lt.find_keys(KeyType.TABLE_INDEX)) == (1 if should_be_pruned else 2)
+        # Use rows_per_segment=1 so that each 10-row segment exceeds max_rows_per_segment
+        # and compaction must perform a real rewrite, creating a new TABLE_INDEX version.
+        lib.append(sym, df_1, prune_previous_version=True)
+        arg_1 = 1  # rows_per_segment

Conversation

alexowens90 commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement or fix?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Apr 1, 2026

ArcticDB Code Review Summary

API & Compatibility

Memory & Safety

Correctness

Code Quality

Testing

Build & Dependencies

Security

PR Title & Description

Documentation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 2, 2026

alexowens90 commented Mar 24, 2026 •

edited

Loading

claude bot commented Mar 26, 2026 •

edited

Loading

claude bot commented Apr 2, 2026 •

edited

Loading