Encode column stats metadata in the segment header rather than column names, and support multi-index#2990
Encode column stats metadata in the segment header rather than column names, and support multi-index#2990poodlewars wants to merge 4 commits intomasterfrom
Conversation
55b1a0e to
6b0048c
Compare
|
PLACEHOLDER_TO_BE_REPLACED |
|
placeholder - will update with full content |
3700752 to
f522b02
Compare
| seg.concatenate(std::move(finalized)); | ||
| } | ||
| google::protobuf::Any any; | ||
| any.PackFrom(merged_header); |
There was a problem hiding this comment.
Unchecked PackFrom return value: any.PackFrom(merged_header) can return false if serialization fails, yet the return value is silently ignored here. The two other call sites in this PR (MinMaxAggregatorData::finalize and merge_column_stats_segments) both check the return value with util::check(packed, ...). This site should be consistent:
| any.PackFrom(merged_header); | |
| bool packed = any.PackFrom(merged_header); | |
| util::check(packed, "Failed to pack merged_header into Any in ColumnStatsGenerationClause#process"); | |
| seg.set_metadata(std::move(any)); |
…formation in the segment header's user metadata field
f522b02 to
1c2bf04
Compare
ArcticDB Code Review Summary (updated)New inline comments posted on this push:
Outstanding items (from previous summary plus new findings):
Passes: no public API changes, protobuf schema backwards-compatible, multi-index support correct, duplicate detection added, error types improved, comprehensive new test coverage, CMakeLists.txt updated, Makefile proxy fix correct. |
| // TODO when we use column stats at read time, just ignore any missing. | ||
| // For maintenance operations, we should raise. | ||
| compatibility::raise<ErrorCode::E_UNRECOGNISED_COLUMN_STATS_VERSION>( | ||
| "Encountered unknown column stat. Upgrade your ArcticDB installation." | ||
| ); |
There was a problem hiding this comment.
The TODO embedded here flags that the desired behaviour is: ignore on read, raise on maintenance — but that distinction is not yet implemented. This constructor is currently called from both read-time and maintenance code paths, so upgrading an older library to read column stats written by a newer version will now throw E_UNRECOGNISED_COLUMN_STATS_VERSION rather than silently skipping the unknown stat.
This is a forward-compatibility regression compared to the previous warn + continue approach. Before this lands, the read/maintenance distinction should either be implemented (e.g. pass a policy flag to the constructor) or the TODO should be turned into a tracked follow-up issue so the behavior is not forgotten.
| // Add new stat columns to the old segment | ||
| old_segment->concatenate(std::move(new_segment)); | ||
| google::protobuf::Any any; | ||
| any.PackFrom(old_header); |
There was a problem hiding this comment.
Three unchecked protobuf return values in this merge block (lines 2009, 2011, 2022):
old_metadata->UnpackTo(&old_header)(line 2009) — return value discarded; could fail silently if the storedAnycontains a different message type.new_segment.metadata()->UnpackTo(&new_header)(line 2011) — same issue;new_segmentwas just produced bymerge_column_stats_segmentswhich packs aColumnStatsHeader, but the check is still good practice.any.PackFrom(old_header)(line 2022) — return value discarded; all otherPackFromcall sites in this PR useutil::check(packed, ...)but this one does not.
The merge_column_stats_segments call site (line 310) and MinMaxAggregatorData::finalize (line 847) both check PackFrom. Every other UnpackTo in this function is checked via util::check(unpacked, ...). These three should be consistent.
| store->update(column_stats_key, std::move(segment_in_memory), update_opts).get(); | ||
| } | ||
| google::protobuf::Any any; | ||
| any.PackFrom(new_header); |
There was a problem hiding this comment.
Unchecked PackFrom return value: any.PackFrom(new_header) return value is silently discarded. Every other PackFrom call site in this PR checks the result with util::check(packed, ...). This should be consistent:
| any.PackFrom(new_header); | |
| google::protobuf::Any any; | |
| bool packed = any.PackFrom(new_header); | |
| util::check(packed, "Failed to pack new_header in drop_column_stats_impl"); | |
| segment_in_memory.reset_metadata(); |
|
|
||
| #include <arcticdb/stream/merge.hpp> | ||
|
|
||
| #include <arcticdb/pipeline/column_stats.hpp> |
There was a problem hiding this comment.
Unnecessary includes: <arcticdb/stream/merge.hpp>, <arcticdb/pipeline/slicing.hpp>, and <arcticdb/storage/store.hpp> are not used by any code in this file. The implementation only requires StreamDescriptor (via the .hpp), ErrorCode/user_input::raise (pulled in transitively), and boost::regex. These should be removed to keep compile-time dependencies minimal and avoid hidden coupling.
Monday: 11292562756 11292649800
We believe that no one has created column stats yet as there is no benefit to users. #2958 will start to support using them at read time. We want to make sure that we get the column stats format right on disk before we announce column stats to users and they start to serialize them. This PR is making changes to how we save column stats on disk.
It also allows us to create column stats over multi-indexed dataframes, which was not possible before.
Global search in Man to check no one ever used
create_column_stats: https://chat-man.slack.com/archives/CKQBVA96D/p1774986019195379Column stats information is currently written in
KeyType::COLUMN_STATSwith column names like,v1.0_min(col_three). This is not ideal because at read time we need to parse these string column names to understand what each statistic means.This PR adds a structure in
descriptors.proto:that we save in
KeyType::COLUMN_STATSheader, in the user defined metadata field.This lets us understand the contents of column stats keys without needing to parse a particular string format.
We keep meaningful strings as the names of the column stats segment columns for debugging.
Compat Testing
Manual testing of what happens if this wheel sees unknown stats