Skip to content

Conversation

@dhama-shashank-meesho
Copy link

@dhama-shashank-meesho dhama-shashank-meesho commented Nov 17, 2025

Important Read

  • Please ensure the GitHub issue is mentioned at the beginning of the PR

What is the purpose of the pull request

This pull request fixes critical issues with column statistics extraction and handling across Delta Lake and Iceberg table formats. The changes ensure accurate and complete statistics are extracted from Parquet files, with proper fallback mechanisms when statistics are missing from metadata checkpoints.

Brief change log

Delta Lake Statistics Extraction

  • Fixed missing statistics handling: Added fallback mechanism in DeltaStatsExtractor to read column statistics directly from Parquet file footers when Delta checkpoint statistics are NULL or empty
  • Improved performance documentation: Added comprehensive JavaDoc explaining the performance implications of reading from Parquet footers vs. Delta checkpoints
  • Enhanced error handling: Added graceful error handling when reading Parquet footers fails, ensuring conversion continues without statistics rather than failing entirely
  • Fixed null safety: Added null checks to prevent NullPointerException when statistics have NULL min/max ranges

Iceberg Statistics Extraction

  • Enhanced Parquet footer reading: Modified IcebergDataFileUpdatesSync to always read statistics from Parquet footers for Parquet files, matching native Iceberg behavior for accuracy and completeness
  • Improved statistics aggregation: Added proper aggregation logic for statistics across multiple row groups in Parquet files (min/max aggregation, null count summation)
  • Optimized empty table handling: Added optimization to skip expensive manifest file scans for empty tables (tables with no snapshots)
  • Fixed file size reading: Enhanced file size extraction to read from filesystem when metadata is missing or invalid
  • Improved statistics merging: Added logic to merge existing statistics with Parquet footer statistics, ensuring completeness

Parquet Statistics Conversion

  • Enhanced type safety: Improved ParquetStatsConverterUtil to safely handle type conversions and prevent ClassCastException errors
  • Better decimal handling: Enhanced DECIMAL type conversion with proper precision and scale handling
  • Improved error handling: Added null checks and type validation to prevent runtime exceptions during statistics extraction

Other Improvements

  • Fixed Delta schema extraction: Minor improvements to DeltaSchemaExtractor for better schema handling
  • Updated .gitignore: Added entries to ignore build artifacts and temporary files
  • Code quality: Improved error messages and logging throughout the statistics extraction pipeline

Verify this pull request

This change added tests and can be verified as follows:

  • Existing test coverage: The changes are covered by existing tests in TestIcebergSync which verify statistics extraction and conversion
  • Integration tests: The changes align with existing integration tests that verify end-to-end conversion between Delta Lake and Iceberg formats
  • Manual verification: The changes have been manually verified by:
    • Running conversion jobs with Delta Lake tables that have NULL checkpoint statistics
    • Verifying that statistics are correctly extracted from Parquet footers
    • Confirming that Iceberg manifest files contain accurate and complete statistics
    • Testing with empty tables to verify the optimization works correctly
    • Validating that file size is correctly read from filesystem when metadata is missing

@the-other-tim-brown
Copy link
Contributor

@dhama-shashank-meesho can you file an issue summarizing the issues you encountered and then fill out the PR template? It will make it easier for other users to search for bugs that they encountered as well and see that you have fixed it.

int openParenIndex = typeName.indexOf("(");
String trimmedTypeName = openParenIndex > 0 ? typeName.substring(0, openParenIndex) : typeName;
switch (trimmedTypeName) {
case "short":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the unit tests to cover this case?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah added

* @param fields the schema fields for which to extract statistics
* @return FileStats with extracted statistics, or empty stats if reading fails
*/
private FileStats readStatsFromParquetFooter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also make sure there is a test case to cover this flow. It can be a basic one, the detailed coverage for translating from parquet stats to our internal stats representation will be covered in #748

Comment on lines +385 to +388
"Failed to read stats from Parquet footer for file {}: {}. "
+ "File will be included without column statistics.",
addFile.path(),
e.getMessage());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Failed to read stats from Parquet footer for file {}: {}. "
+ "File will be included without column statistics.",
addFile.path(),
e.getMessage());
"Failed to read stats from Parquet footer for file {}. "
+ "File will be included without column statistics.",
addFile.path(),
e);

This will log out the full exception stacktrace to provide more details on the failure which makes it easier to debug.

log.debug("Table has no snapshot, skipping file scan (table is empty)");
}

// Filter out files that already exist in Iceberg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not required, the applyDiff is only for incremental sync. It assumes that the table is already synced once.

long recordCount = dataFile.getRecordCount();
List<ColumnStat> columnStats;

// For Parquet files, ALWAYS read from footer to match native Iceberg behavior
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is avoided as it introduces overheads and the source stats are typically from the files themselves. Iceberg writers actually generate these stats while writing instead of reading from the footer to avoid this same overhead.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we can achive the native iceberg performance ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this question in this context.

@dhama-shashank-meesho dhama-shashank-meesho changed the title Fixes Column statistics extraction with Parquet footer fallback for Delta and Iceberg Nov 25, 2025
@dhama-shashank-meesho
Copy link
Author

@dhama-shashank-meesho can you file an issue summarizing the issues you encountered and then fill out the PR template? It will make it easier for other users to search for bugs that they encountered as well and see that you have fixed it.

Yes I have added

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants