Skip to content

Encode column stats metadata in the segment header rather than column names, and support multi-index#2990

Open
poodlewars wants to merge 4 commits intomasterfrom
aseaton/column-stats/field-encoding-proper
Open

Encode column stats metadata in the segment header rather than column names, and support multi-index#2990
poodlewars wants to merge 4 commits intomasterfrom
aseaton/column-stats/field-encoding-proper

Conversation

@poodlewars
Copy link
Copy Markdown
Collaborator

@poodlewars poodlewars commented Mar 26, 2026

Monday: 11292562756 11292649800

We believe that no one has created column stats yet as there is no benefit to users. #2958 will start to support using them at read time. We want to make sure that we get the column stats format right on disk before we announce column stats to users and they start to serialize them. This PR is making changes to how we save column stats on disk.

It also allows us to create column stats over multi-indexed dataframes, which was not possible before.

Global search in Man to check no one ever used create_column_stats: https://chat-man.slack.com/archives/CKQBVA96D/p1774986019195379

Column stats information is currently written in KeyType::COLUMN_STATS with column names like, v1.0_min(col_three). This is not ideal because at read time we need to parse these string column names to understand what each statistic means.

This PR adds a structure in descriptors.proto:

enum ColumnStatsType {
    // Older clients reading a new enum value written by a newer client will decode to this first
    // element, so make sure they don't mis-interpret it as a statistic they understand.
    // https://protobuf.dev/best-practices/dos-donts/#unspecified-enum
    COLUMN_STATS_UNKNOWN = 0;

    // The version numbers here refer to the format of a given statistic. For example we might start
    // off saving string min and max truncated to 8 bytes in a uint64_t, and later change to saving it truncated
    // to a different length. That would necessitate a COLUMN_STATS_MIN_V2 so that old readers do not
    // misinterpret the new statistics format.
    COLUMN_STATS_MIN_V1 = 1;
    COLUMN_STATS_MAX_V1 = 2;
}

message StatColMapping {
    uint32 stats_seg_offset = 1;  // offset in to fields in the KeyType::COLUMN_STATS's StreamDescriptor
    uint32 data_col_offset = 2;  // offset in to TimeseriesDescriptor#fields_ (in the Index key)
    ColumnStatsType type = 3;
}

// Stored in the user defined metadata for KeyType::COLUMN_STATS
message ColumnStatsHeader {
    // This version number refers to the format of this header structure.
    // For example if we ever want to stop using StatColMapping and move to a different encoding,
    // we would increment the version number. This helps to avoid older clients mis-interpreting
    // existing fields (like an empty StatColMapping in this example).
    uint32 version = 1;
    repeated StatColMapping stats = 2;
    // end of fields in version 1
}

that we save in KeyType::COLUMN_STATS header, in the user defined metadata field.

This lets us understand the contents of column stats keys without needing to parse a particular string format.

We keep meaningful strings as the names of the column stats segment columns for debugging.

Compat Testing

Manual testing of what happens if this wheel sees unknown stats

@poodlewars poodlewars changed the title Aseaton/column stats/field encoding proper Encode column stats metadata in the segment header rather than column names Mar 26, 2026
@poodlewars poodlewars added patch Small change, should increase patch version no-release-notes This PR shouldn't be added to release notes. labels Mar 26, 2026
@poodlewars poodlewars changed the title Encode column stats metadata in the segment header rather than column names Encode column stats metadata in the segment header rather than column names, and support multi-index Mar 31, 2026
@poodlewars poodlewars force-pushed the aseaton/column-stats/field-encoding-proper branch from 55b1a0e to 6b0048c Compare March 31, 2026 19:54
@poodlewars poodlewars marked this pull request as ready for review March 31, 2026 19:55
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 31, 2026

PLACEHOLDER_TO_BE_REPLACED

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 1, 2026

placeholder - will update with full content

@poodlewars poodlewars force-pushed the aseaton/column-stats/field-encoding-proper branch from 3700752 to f522b02 Compare April 1, 2026 08:48
seg.concatenate(std::move(finalized));
}
google::protobuf::Any any;
any.PackFrom(merged_header);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unchecked PackFrom return value: any.PackFrom(merged_header) can return false if serialization fails, yet the return value is silently ignored here. The two other call sites in this PR (MinMaxAggregatorData::finalize and merge_column_stats_segments) both check the return value with util::check(packed, ...). This site should be consistent:

Suggested change
any.PackFrom(merged_header);
bool packed = any.PackFrom(merged_header);
util::check(packed, "Failed to pack merged_header into Any in ColumnStatsGenerationClause#process");
seg.set_metadata(std::move(any));

Alex Seaton added 2 commits April 1, 2026 09:58
@poodlewars poodlewars force-pushed the aseaton/column-stats/field-encoding-proper branch from f522b02 to 1c2bf04 Compare April 1, 2026 09:02
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 1, 2026

ArcticDB Code Review Summary (updated)

New inline comments posted on this push:

  • version_core.cpp line 2022: three unchecked protobuf calls in create path merge block (old_metadata->UnpackTo, new_segment.metadata()->UnpackTo, any.PackFrom)
  • version_core.cpp line 2096: unchecked PackFrom in drop path
  • column_name_resolution.cpp line 14: three unused includes (stream/merge.hpp, pipeline/slicing.hpp, storage/store.hpp)

Outstanding items (from previous summary plus new findings):

  1. FAIL Null-safety: old_segment->metadata() dereference (version_core.cpp line 1952)
  2. FAIL Unchecked PackFrom in ColumnStatsGenerationClause::process (clause.cpp line 1718)
  3. FAIL Unchecked protobuf calls in create path merge block (version_core.cpp lines 2009/2011/2022) [NEW]
  4. FAIL Unchecked PackFrom in drop path (version_core.cpp line 2096) [NEW]
  5. WARN Compat test for old-format column stats data still TODO
  6. WARN COLUMN_STATS_UNKNOWN raises on all paths -- read/maintenance distinction unimplemented
  7. WARN Unused includes in column_name_resolution.cpp [NEW]

Passes: no public API changes, protobuf schema backwards-compatible, multi-index support correct, duplicate detection added, error types improved, comprehensive new test coverage, CMakeLists.txt updated, Makefile proxy fix correct.

Comment on lines +156 to +160
// TODO when we use column stats at read time, just ignore any missing.
// For maintenance operations, we should raise.
compatibility::raise<ErrorCode::E_UNRECOGNISED_COLUMN_STATS_VERSION>(
"Encountered unknown column stat. Upgrade your ArcticDB installation."
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO embedded here flags that the desired behaviour is: ignore on read, raise on maintenance — but that distinction is not yet implemented. This constructor is currently called from both read-time and maintenance code paths, so upgrading an older library to read column stats written by a newer version will now throw E_UNRECOGNISED_COLUMN_STATS_VERSION rather than silently skipping the unknown stat.

This is a forward-compatibility regression compared to the previous warn + continue approach. Before this lands, the read/maintenance distinction should either be implemented (e.g. pass a policy flag to the constructor) or the TODO should be turned into a tracked follow-up issue so the behavior is not forgotten.

// Add new stat columns to the old segment
old_segment->concatenate(std::move(new_segment));
google::protobuf::Any any;
any.PackFrom(old_header);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three unchecked protobuf return values in this merge block (lines 2009, 2011, 2022):

  • old_metadata->UnpackTo(&old_header) (line 2009) — return value discarded; could fail silently if the stored Any contains a different message type.
  • new_segment.metadata()->UnpackTo(&new_header) (line 2011) — same issue; new_segment was just produced by merge_column_stats_segments which packs a ColumnStatsHeader, but the check is still good practice.
  • any.PackFrom(old_header) (line 2022) — return value discarded; all other PackFrom call sites in this PR use util::check(packed, ...) but this one does not.

The merge_column_stats_segments call site (line 310) and MinMaxAggregatorData::finalize (line 847) both check PackFrom. Every other UnpackTo in this function is checked via util::check(unpacked, ...). These three should be consistent.

store->update(column_stats_key, std::move(segment_in_memory), update_opts).get();
}
google::protobuf::Any any;
any.PackFrom(new_header);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unchecked PackFrom return value: any.PackFrom(new_header) return value is silently discarded. Every other PackFrom call site in this PR checks the result with util::check(packed, ...). This should be consistent:

Suggested change
any.PackFrom(new_header);
google::protobuf::Any any;
bool packed = any.PackFrom(new_header);
util::check(packed, "Failed to pack new_header in drop_column_stats_impl");
segment_in_memory.reset_metadata();


#include <arcticdb/stream/merge.hpp>

#include <arcticdb/pipeline/column_stats.hpp>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary includes: <arcticdb/stream/merge.hpp>, <arcticdb/pipeline/slicing.hpp>, and <arcticdb/storage/store.hpp> are not used by any code in this file. The implementation only requires StreamDescriptor (via the .hpp), ErrorCode/user_input::raise (pulled in transitively), and boost::regex. These should be removed to keep compile-time dependencies minimal and avoid hidden coupling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-release-notes This PR shouldn't be added to release notes. patch Small change, should increase patch version

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant