Skip to content

Conversation

@konstantinb
Copy link
Contributor

What changes were proposed in this pull request?

HIVE-29332: Use null values for min/max Range values of numeric columns if the corresponding stats values are not set

Why are the changes needed?

Stats could be severely underestimated for some queries. In addition, invalid ranges like [0, -10] were theoretically possible. The following screenshot clearly highlights changes in the EXPLAIN output:
HIVE-2932-explain-before-and-after

Does this PR introduce any user-facing change?

No

How was this patch tested?

with a query test file, with unittesting, and with a proprietary Hive implementation

keys: d_datekey (type: bigint), d_sellingseason (type: string)
null sort order: zz
Statistics: Num rows: 1 Data size: 96 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 2 Data size: 104 Basic stats: COMPLETE Column stats: COMPLETE
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now accurately reflects 2 values in the "IN" clause whereas before these changes, "1" was used due to the interval of [0, 0] with the length of 1

Filter Operator
predicate: (d_year) IN (1985, 2004) (type: boolean)
Statistics: Num rows: 1 Data size: 96 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 2 Data size: 104 Basic stats: COMPLETE Column stats: COMPLETE
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now accurately reflects 2 values in the "IN" clause whereas before these changes, "1" was used due to the interval of [0, 0] with the length of 1

@sonarqubecloud
Copy link

@konstantinb konstantinb marked this pull request as ready for review November 24, 2025 22:36
StatObjectConverter.fillColumnStatisticsData(partCol.getType(), data, r == null ? null : r.minValue,
r == null ? null : r.maxValue, r == null ? null : r.minValue, r == null ? null : r.maxValue,
r == null ? null : r.minValue.toString(), r == null ? null : r.maxValue.toString(),
r == null || r.minValue == null ? null : r.minValue.toString(), r == null || r.maxValue == null ? null : r.maxValue.toString(),
Copy link
Member

@deniskuzZ deniskuzZ Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not do the same for long stats?

Could we refactor StatObjectConverter.fillColumnStatisticsData and drop the repetition: Object llow, Object lhigh, Object dlow, Object dhigh, Object declow, Object dechigh.
Anyways they are populated with exact same values. Then we can reuse method in StatsUtils.

// Populate ColStatistics from LongColumnStatsData, checking isSet for optional i64 fields
private static void populateColStatisticsFromLongStats(
org.apache.hadoop.hive.metastore.api.LongColumnStatsData longStats,
ColStatistics cs, double avgColLen) {
Copy link
Member

@deniskuzZ deniskuzZ Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe make ColStatistics cs 1st arg? i.e (cs, longStats, avgColLen). note, would be nicer to reuse refactored StatObjectConverter.fillColumnStatisticsData

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants