Encode column stats metadata in the segment header rather than column names, and support multi-index by poodlewars · Pull Request #2990 · man-group/ArcticDB

poodlewars · 2026-03-26T17:08:18Z

Monday: 11292562756 11292649800

We believe that no one has created column stats yet as there is no benefit to users. #2958 will start to support using them at read time. We want to make sure that we get the column stats format right on disk before we announce column stats to users and they start to serialize them. This PR is making changes to how we save column stats on disk.

It also allows us to create column stats over multi-indexed dataframes, which was not possible before.

Global search in Man to check no one ever used create_column_stats: https://chat-man.slack.com/archives/CKQBVA96D/p1774986019195379

Column stats information is currently written in KeyType::COLUMN_STATS with column names like, v1.0_min(col_three). This is not ideal because at read time we need to parse these string column names to understand what each statistic means.

This PR adds a structure in descriptors.proto:

enum ColumnStatsType {
    // Older clients reading a new enum value written by a newer client will decode to this first
    // element, so make sure they don't mis-interpret it as a statistic they understand.
    // https://protobuf.dev/best-practices/dos-donts/#unspecified-enum
    COLUMN_STATS_UNKNOWN = 0;

    // The version numbers here refer to the format of a given statistic. For example we might start
    // off saving string min and max truncated to 8 bytes in a uint64_t, and later change to saving it truncated
    // to a different length. That would necessitate a COLUMN_STATS_MIN_V2 so that old readers do not
    // misinterpret the new statistics format.
    COLUMN_STATS_MIN_V1 = 1;
    COLUMN_STATS_MAX_V1 = 2;
}

message StatColMapping {
    uint32 stats_seg_offset = 1;  // offset in to fields in the KeyType::COLUMN_STATS's StreamDescriptor
    uint32 data_col_offset = 2;  // offset in to TimeseriesDescriptor#fields_ (in the Index key)
    ColumnStatsType type = 3;
}

// Stored in the user defined metadata for KeyType::COLUMN_STATS
message ColumnStatsHeader {
    // This version number refers to the format of this header structure.
    // For example if we ever want to stop using StatColMapping and move to a different encoding,
    // we would increment the version number. This helps to avoid older clients mis-interpreting
    // existing fields (like an empty StatColMapping in this example).
    uint32 version = 1;
    repeated StatColMapping stats = 2;
    // end of fields in version 1
}

that we save in KeyType::COLUMN_STATS header, in the user defined metadata field.

This lets us understand the contents of column stats keys without needing to parse a particular string format.

We keep meaningful strings as the names of the column stats segment columns for debugging.

Compat Testing

Manual testing of what happens if this wheel sees unknown stats

cpp/arcticdb/version/version_core.cpp

cpp/arcticdb/pipeline/column_stats.cpp

claude · 2026-03-31T20:06:50Z

PLACEHOLDER_TO_BE_REPLACED

cpp/arcticdb/processing/clause.cpp

cpp/arcticdb/pipeline/column_stats.cpp

claude · 2026-04-01T08:47:29Z

placeholder - will update with full content

claude · 2026-04-01T08:52:46Z

cpp/arcticdb/processing/clause.cpp

+        seg.concatenate(std::move(finalized));
    }
+    google::protobuf::Any any;
+    any.PackFrom(merged_header);


Unchecked PackFrom return value: any.PackFrom(merged_header) can return false if serialization fails, yet the return value is silently ignored here. The two other call sites in this PR (MinMaxAggregatorData::finalize and merge_column_stats_segments) both check the return value with util::check(packed, ...). This site should be consistent:

Suggested change

any.PackFrom(merged_header);

bool packed = any.PackFrom(merged_header);

util::check(packed, "Failed to pack merged_header into Any in ColumnStatsGenerationClause#process");

seg.set_metadata(std::move(any));

…formation in the segment header's user metadata field

claude · 2026-04-01T09:08:47Z

ArcticDB Code Review Summary (updated)

New inline comments posted on this push:

version_core.cpp line 2022: three unchecked protobuf calls in create path merge block (old_metadata->UnpackTo, new_segment.metadata()->UnpackTo, any.PackFrom)
version_core.cpp line 2096: unchecked PackFrom in drop path
column_name_resolution.cpp line 14: three unused includes (stream/merge.hpp, pipeline/slicing.hpp, storage/store.hpp)

Outstanding items (from previous summary plus new findings):

FAIL Null-safety: old_segment->metadata() dereference (version_core.cpp line 1952)
FAIL Unchecked PackFrom in ColumnStatsGenerationClause::process (clause.cpp line 1718)
FAIL Unchecked protobuf calls in create path merge block (version_core.cpp lines 2009/2011/2022) [NEW]
FAIL Unchecked PackFrom in drop path (version_core.cpp line 2096) [NEW]
WARN Compat test for old-format column stats data still TODO
WARN COLUMN_STATS_UNKNOWN raises on all paths -- read/maintenance distinction unimplemented
WARN Unused includes in column_name_resolution.cpp [NEW]

Passes: no public API changes, protobuf schema backwards-compatible, multi-index support correct, duplicate detection added, error types improved, comprehensive new test coverage, CMakeLists.txt updated, Makefile proxy fix correct.

claude · 2026-04-01T13:50:52Z

cpp/arcticdb/pipeline/column_stats.cpp

+            // TODO when we use column stats at read time, just ignore any missing.
+            // For maintenance operations, we should raise.
+            compatibility::raise<ErrorCode::E_UNRECOGNISED_COLUMN_STATS_VERSION>(
+                    "Encountered unknown column stat. Upgrade your ArcticDB installation."
+            );


The TODO embedded here flags that the desired behaviour is: ignore on read, raise on maintenance — but that distinction is not yet implemented. This constructor is currently called from both read-time and maintenance code paths, so upgrading an older library to read column stats written by a newer version will now throw E_UNRECOGNISED_COLUMN_STATS_VERSION rather than silently skipping the unknown stat.

This is a forward-compatibility regression compared to the previous warn + continue approach. Before this lands, the read/maintenance distinction should either be implemented (e.g. pass a policy flag to the constructor) or the TODO should be turned into a tracked follow-up issue so the behavior is not forgotten.

claude · 2026-04-01T14:23:10Z

cpp/arcticdb/version/version_core.cpp

+        // Add new stat columns to the old segment
        old_segment->concatenate(std::move(new_segment));
+        google::protobuf::Any any;
+        any.PackFrom(old_header);


Three unchecked protobuf return values in this merge block (lines 2009, 2011, 2022):

old_metadata->UnpackTo(&old_header) (line 2009) — return value discarded; could fail silently if the stored Any contains a different message type.

new_segment.metadata()->UnpackTo(&new_header) (line 2011) — same issue; new_segment was just produced by merge_column_stats_segments which packs a ColumnStatsHeader, but the check is still good practice.

any.PackFrom(old_header) (line 2022) — return value discarded; all other PackFrom call sites in this PR use util::check(packed, ...) but this one does not.

The merge_column_stats_segments call site (line 310) and MinMaxAggregatorData::finalize (line 847) both check PackFrom. Every other UnpackTo in this function is checked via util::check(unpacked, ...). These three should be consistent.

claude · 2026-04-01T14:23:16Z

cpp/arcticdb/version/version_core.cpp

-            store->update(column_stats_key, std::move(segment_in_memory), update_opts).get();
        }
+        google::protobuf::Any any;
+        any.PackFrom(new_header);


Unchecked PackFrom return value: any.PackFrom(new_header) return value is silently discarded. Every other PackFrom call site in this PR checks the result with util::check(packed, ...). This should be consistent:

Suggested change

any.PackFrom(new_header);

google::protobuf::Any any;

bool packed = any.PackFrom(new_header);

util::check(packed, "Failed to pack new_header in drop_column_stats_impl");

segment_in_memory.reset_metadata();

claude · 2026-04-01T14:23:46Z

cpp/arcticdb/pipeline/column_name_resolution.cpp

+
+#include <arcticdb/stream/merge.hpp>
+
+#include <arcticdb/pipeline/column_stats.hpp>


Unnecessary includes: <arcticdb/stream/merge.hpp>, <arcticdb/pipeline/slicing.hpp>, and <arcticdb/storage/store.hpp> are not used by any code in this file. The implementation only requires StreamDescriptor (via the .hpp), ErrorCode/user_input::raise (pulled in transitively), and boost::regex. These should be removed to keep compile-time dependencies minimal and avoid hidden coupling.

poodlewars changed the title ~~Aseaton/column stats/field encoding proper~~ Encode column stats metadata in the segment header rather than column names Mar 26, 2026

poodlewars added patch Small change, should increase patch version no-release-notes This PR shouldn't be added to release notes. labels Mar 26, 2026

poodlewars changed the title ~~Encode column stats metadata in the segment header rather than column names~~ Encode column stats metadata in the segment header rather than column names, and support multi-index Mar 31, 2026

poodlewars force-pushed the aseaton/column-stats/field-encoding-proper branch from 55b1a0e to 6b0048c Compare March 31, 2026 19:54

poodlewars marked this pull request as ready for review March 31, 2026 19:55

poodlewars requested review from IvoDD and alexowens90 as code owners March 31, 2026 19:55

claude bot reviewed Mar 31, 2026

View reviewed changes

cpp/arcticdb/version/version_core.cpp Outdated Show resolved Hide resolved

claude bot reviewed Mar 31, 2026

View reviewed changes

cpp/arcticdb/version/version_core.cpp Outdated Show resolved Hide resolved

claude bot reviewed Mar 31, 2026

View reviewed changes

cpp/arcticdb/pipeline/column_stats.cpp Outdated Show resolved Hide resolved