-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Is your feature request related to a problem or challenge?
The overlap-based NDV merge formula (estimate_ndv_with_overlap) introduced in #19957 (and shared with Union from #20846) is not associative: merging (A+B)+C produces a different result than A+(B+C) because the intermediate merge updates min/max/NDV, which "smears" the column statistics before the next pairwise merge.
Example:
A = [0,100], NDV=80
B = [50,150], NDV=60
C = [100,200], NDV=50
(A+B)+C = 135
A+(B+C) = 137
Current usage in try_merge, as addressed in #19957, folds left-to-right over row groups, so the result depends on row group ordering. Practically, row group order within a Parquet file is stable, so results are deterministic for a given file. The difference is typically small (bounded by the uniform-distribution assumption).
Describe the solution you'd like
A multi-way merge that computes the overlap formula over all inputs at once rather than pairwise, using the original (unsmeared) min/max/NDV from each input.
Describe alternatives you've considered
Store HLL sketches per row group/column in Parquet and merge them for exact NDV computation, bypassing the scalar heuristic entirely. Parquet does not natively support this, but apache/arrow-rs#8608 (comment) proposes encoding distinct-count estimates in user-defined key=value metadata, which could serve as a vehicle for HLL sketches. Citing as an alternative for completeness, not currently viable as not implemented yet.
Additional context
- Review comment: feat: Extract NDV (distinct_count) statistics from Parquet metadata #19957 (comment) (@gene-bordegaray)
- Related PR: [Minor] propagate distinct_count as inexact through unions #20846 (Union NDV overlap formula)