Skip to content

Improve NDV merge to be associative across multiple inputs #20966

@asolimando

Description

@asolimando

Is your feature request related to a problem or challenge?

The overlap-based NDV merge formula (estimate_ndv_with_overlap) introduced in #19957 (and shared with Union from #20846) is not associative: merging (A+B)+C produces a different result than A+(B+C) because the intermediate merge updates min/max/NDV, which "smears" the column statistics before the next pairwise merge.

Example:

A = [0,100], NDV=80
B = [50,150], NDV=60
C = [100,200], NDV=50

(A+B)+C = 135
A+(B+C) = 137

Current usage in try_merge, as addressed in #19957, folds left-to-right over row groups, so the result depends on row group ordering. Practically, row group order within a Parquet file is stable, so results are deterministic for a given file. The difference is typically small (bounded by the uniform-distribution assumption).

Describe the solution you'd like

A multi-way merge that computes the overlap formula over all inputs at once rather than pairwise, using the original (unsmeared) min/max/NDV from each input.

Describe alternatives you've considered

Store HLL sketches per row group/column in Parquet and merge them for exact NDV computation, bypassing the scalar heuristic entirely. Parquet does not natively support this, but apache/arrow-rs#8608 (comment) proposes encoding distinct-count estimates in user-defined key=value metadata, which could serve as a vehicle for HLL sketches. Citing as an alternative for completeness, not currently viable as not implemented yet.

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions