-
Notifications
You must be signed in to change notification settings - Fork 190
Column statistics extraction with Parquet footer fallback for Delta and Iceberg #760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Column statistics extraction with Parquet footer fallback for Delta and Iceberg #760
Conversation
|
@dhama-shashank-meesho can you file an issue summarizing the issues you encountered and then fill out the PR template? It will make it easier for other users to search for bugs that they encountered as well and see that you have fixed it. |
| int openParenIndex = typeName.indexOf("("); | ||
| String trimmedTypeName = openParenIndex > 0 ? typeName.substring(0, openParenIndex) : typeName; | ||
| switch (trimmedTypeName) { | ||
| case "short": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the unit tests to cover this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah added
| * @param fields the schema fields for which to extract statistics | ||
| * @return FileStats with extracted statistics, or empty stats if reading fails | ||
| */ | ||
| private FileStats readStatsFromParquetFooter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also make sure there is a test case to cover this flow. It can be a basic one, the detailed coverage for translating from parquet stats to our internal stats representation will be covered in #748
| "Failed to read stats from Parquet footer for file {}: {}. " | ||
| + "File will be included without column statistics.", | ||
| addFile.path(), | ||
| e.getMessage()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "Failed to read stats from Parquet footer for file {}: {}. " | |
| + "File will be included without column statistics.", | |
| addFile.path(), | |
| e.getMessage()); | |
| "Failed to read stats from Parquet footer for file {}. " | |
| + "File will be included without column statistics.", | |
| addFile.path(), | |
| e); |
This will log out the full exception stacktrace to provide more details on the failure which makes it easier to debug.
| log.debug("Table has no snapshot, skipping file scan (table is empty)"); | ||
| } | ||
|
|
||
| // Filter out files that already exist in Iceberg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not required, the applyDiff is only for incremental sync. It assumes that the table is already synced once.
| long recordCount = dataFile.getRecordCount(); | ||
| List<ColumnStat> columnStats; | ||
|
|
||
| // For Parquet files, ALWAYS read from footer to match native Iceberg behavior |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is avoided as it introduces overheads and the source stats are typically from the files themselves. Iceberg writers actually generate these stats while writing instead of reading from the footer to avoid this same overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we can achive the native iceberg performance ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this question in this context.
Yes I have added |
Important Read
What is the purpose of the pull request
This pull request fixes critical issues with column statistics extraction and handling across Delta Lake and Iceberg table formats. The changes ensure accurate and complete statistics are extracted from Parquet files, with proper fallback mechanisms when statistics are missing from metadata checkpoints.
Brief change log
Delta Lake Statistics Extraction
DeltaStatsExtractorto read column statistics directly from Parquet file footers when Delta checkpoint statistics are NULL or emptyIceberg Statistics Extraction
IcebergDataFileUpdatesSyncto always read statistics from Parquet footers for Parquet files, matching native Iceberg behavior for accuracy and completenessParquet Statistics Conversion
ParquetStatsConverterUtilto safely handle type conversions and prevent ClassCastException errorsOther Improvements
DeltaSchemaExtractorfor better schema handlingVerify this pull request
This change added tests and can be verified as follows:
TestIcebergSyncwhich verify statistics extraction and conversion