HIVE-29332: StatsUtils to stop using "0" defaults on numeric columns without optional high/low info #6208

konstantinb · 2025-11-22T01:35:02Z

What changes were proposed in this pull request?

HIVE-29332: Use null values for min/max Range values of numeric columns if the corresponding stats values are not set

Why are the changes needed?

Stats could be severely underestimated for some queries. In addition, invalid ranges like [0, -10] were theoretically possible. The following screenshot clearly highlights changes in the EXPLAIN output:

Does this PR introduce any user-facing change?

No

How was this patch tested?

with a query test file, with unittesting, and with a proprietary Hive implementation

…without optional high/low info

konstantinb · 2025-11-24T19:46:16Z

ql/src/test/results/clientpositive/llap/constraints_explain_ddl.q.out

                      keys: d_datekey (type: bigint), d_sellingseason (type: string)
                      null sort order: zz
-                      Statistics: Num rows: 1 Data size: 96 Basic stats: COMPLETE Column stats: COMPLETE
+                      Statistics: Num rows: 2 Data size: 104 Basic stats: COMPLETE Column stats: COMPLETE


This now accurately reflects 2 values in the "IN" clause whereas before these changes, "1" was used due to the interval of [0, 0] with the length of 1

konstantinb · 2025-11-24T19:46:32Z

ql/src/test/results/clientpositive/llap/constraints_optimization.q.out

                  Filter Operator
                    predicate: (d_year) IN (1985, 2004) (type: boolean)
-                    Statistics: Num rows: 1 Data size: 96 Basic stats: COMPLETE Column stats: COMPLETE
+                    Statistics: Num rows: 2 Data size: 104 Basic stats: COMPLETE Column stats: COMPLETE


This now accurately reflects 2 values in the "IN" clause whereas before these changes, "1" was used due to the interval of [0, 0] with the length of 1

sonarqubecloud · 2025-11-24T20:58:57Z

Quality Gate passed

Issues
21 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

deniskuzZ · 2025-11-28T10:03:25Z

ql/src/java/org/apache/hadoop/hive/ql/ddl/table/info/desc/DescTableOperation.java

    StatObjectConverter.fillColumnStatisticsData(partCol.getType(), data, r == null ? null : r.minValue,
        r == null ? null : r.maxValue, r == null ? null : r.minValue, r == null ? null : r.maxValue,
-        r == null ? null : r.minValue.toString(), r == null ? null : r.maxValue.toString(),
+        r == null || r.minValue == null ? null : r.minValue.toString(), r == null || r.maxValue == null ? null : r.maxValue.toString(),


Why not do the same for long stats?

Could we refactor StatObjectConverter.fillColumnStatisticsData and drop the repetition: Object llow, Object lhigh, Object dlow, Object dhigh, Object declow, Object dechigh.
Anyways they are populated with exact same values. Then we can reuse method in StatsUtils.

deniskuzZ · 2025-11-28T10:08:27Z

ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java

+  // Populate ColStatistics from LongColumnStatsData, checking isSet for optional i64 fields
+  private static void populateColStatisticsFromLongStats(
+      org.apache.hadoop.hive.metastore.api.LongColumnStatsData longStats,
+      ColStatistics cs, double avgColLen) {


maybe make ColStatistics cs 1st arg? i.e (cs, longStats, avgColLen). note, would be nicer to reuse refactored StatObjectConverter.fillColumnStatisticsData

HIVE-29332: StatsUtils to stop using "0" defaults on numeric columns …

89c48c2

…without optional high/low info

asf-ci-hive added tests pending tests unstable and removed tests pending labels Nov 22, 2025

HIVE-29332: test results reflecting the updated statistics calculations

5632ab9

asf-ci-hive added tests pending and removed tests unstable labels Nov 24, 2025

konstantinb commented Nov 24, 2025

View reviewed changes

asf-ci-hive added tests passed and removed tests pending labels Nov 24, 2025

konstantinb marked this pull request as ready for review November 24, 2025 22:36

deniskuzZ reviewed Nov 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HIVE-29332: StatsUtils to stop using "0" defaults on numeric columns without optional high/low info #6208

HIVE-29332: StatsUtils to stop using "0" defaults on numeric columns without optional high/low info #6208

konstantinb commented Nov 22, 2025

Uh oh!

konstantinb Nov 24, 2025

Uh oh!

konstantinb Nov 24, 2025

Uh oh!

sonarqubecloud bot commented Nov 24, 2025

Uh oh!

deniskuzZ Nov 28, 2025 •

edited

Loading

Uh oh!

deniskuzZ Nov 28, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HIVE-29332: StatsUtils to stop using "0" defaults on numeric columns without optional high/low info #6208

Are you sure you want to change the base?

HIVE-29332: StatsUtils to stop using "0" defaults on numeric columns without optional high/low info #6208

Conversation

konstantinb commented Nov 22, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

konstantinb Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

konstantinb Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Nov 24, 2025

Quality Gate passed

Uh oh!

deniskuzZ Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deniskuzZ Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

deniskuzZ Nov 28, 2025 •

edited

Loading

deniskuzZ Nov 28, 2025 •

edited

Loading