-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-29332: StatsUtils to stop using "0" defaults on numeric columns without optional high/low info #6208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…without optional high/low info
| keys: d_datekey (type: bigint), d_sellingseason (type: string) | ||
| null sort order: zz | ||
| Statistics: Num rows: 1 Data size: 96 Basic stats: COMPLETE Column stats: COMPLETE | ||
| Statistics: Num rows: 2 Data size: 104 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This now accurately reflects 2 values in the "IN" clause whereas before these changes, "1" was used due to the interval of [0, 0] with the length of 1
| Filter Operator | ||
| predicate: (d_year) IN (1985, 2004) (type: boolean) | ||
| Statistics: Num rows: 1 Data size: 96 Basic stats: COMPLETE Column stats: COMPLETE | ||
| Statistics: Num rows: 2 Data size: 104 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This now accurately reflects 2 values in the "IN" clause whereas before these changes, "1" was used due to the interval of [0, 0] with the length of 1
|
| StatObjectConverter.fillColumnStatisticsData(partCol.getType(), data, r == null ? null : r.minValue, | ||
| r == null ? null : r.maxValue, r == null ? null : r.minValue, r == null ? null : r.maxValue, | ||
| r == null ? null : r.minValue.toString(), r == null ? null : r.maxValue.toString(), | ||
| r == null || r.minValue == null ? null : r.minValue.toString(), r == null || r.maxValue == null ? null : r.maxValue.toString(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not do the same for long stats?
Could we refactor StatObjectConverter.fillColumnStatisticsData and drop the repetition: Object llow, Object lhigh, Object dlow, Object dhigh, Object declow, Object dechigh.
Anyways they are populated with exact same values. Then we can reuse method in StatsUtils.
| // Populate ColStatistics from LongColumnStatsData, checking isSet for optional i64 fields | ||
| private static void populateColStatisticsFromLongStats( | ||
| org.apache.hadoop.hive.metastore.api.LongColumnStatsData longStats, | ||
| ColStatistics cs, double avgColLen) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe make ColStatistics cs 1st arg? i.e (cs, longStats, avgColLen). note, would be nicer to reuse refactored StatObjectConverter.fillColumnStatisticsData



What changes were proposed in this pull request?
HIVE-29332: Use null values for min/max Range values of numeric columns if the corresponding stats values are not set
Why are the changes needed?
Stats could be severely underestimated for some queries. In addition, invalid ranges like [0, -10] were theoretically possible. The following screenshot clearly highlights changes in the EXPLAIN output:

Does this PR introduce any user-facing change?
No
How was this patch tested?
with a query test file, with unittesting, and with a proprietary Hive implementation