Improve NDV merge to be associative across multiple inputs

### Is your feature request related to a problem or challenge?

The overlap-based NDV merge formula (`estimate_ndv_with_overlap`) introduced in #19957 (and shared with Union from #20846) is not associative: merging `(A+B)+C` produces a different result than `A+(B+C)` because the intermediate merge updates min/max/NDV, which "smears" the column statistics before the next pairwise merge.

Example:
```
A = [0,100], NDV=80
B = [50,150], NDV=60
C = [100,200], NDV=50

(A+B)+C = 135
A+(B+C) = 137
```

Current usage in `try_merge`, as addressed in #19957, folds left-to-right over row groups, so the result depends on row group ordering. Practically, row group order within a Parquet file is stable, so results are deterministic for a given file. The difference is typically small (bounded by the uniform-distribution assumption).

### Describe the solution you'd like

A multi-way merge that computes the overlap formula over all inputs at once rather than pairwise, using the original (unsmeared) min/max/NDV from each input.

### Describe alternatives you've considered

Store HLL sketches per row group/column in Parquet and merge them for exact NDV computation, bypassing the scalar heuristic entirely. Parquet does not natively support this, but [apache/arrow-rs#8608 (comment)](https://github.com/apache/arrow-rs/issues/8608#issuecomment-4041324952) proposes encoding distinct-count estimates in user-defined key=value metadata, which could serve as a vehicle for HLL sketches. Citing as an alternative for completeness, not currently viable as not implemented yet.


### Additional context

- Review comment: https://github.com/apache/datafusion/pull/19957#discussion_r2936704265 (@gene-bordegaray)
- Related PR: #20846 (Union NDV overlap formula)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve NDV merge to be associative across multiple inputs #20966

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve NDV merge to be associative across multiple inputs #20966

Description

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions