Generated checkpoint files are unexpectedly large and potentially broken #187

igor-lobanov-maersk · 2024-07-19T16:35:48Z

We use kafka-delta-ingest daemon to write data from a Kafka topic to a Delta Lake table on ADLS Gen2 account, and then use Dremio for processing the data onwards. There's about 10Gb of data saved daily, and the table size is approaching 1Tb right now. Up until recently we had not have metadata checkpointing enabled on the table and this led to queries becoming progressively slower.

First we tried checkpointing the table with Spark, and it worked well. The checkpoint parquet file generated by Spark is around 45Mb, and the query performance improved considerably.

Then we enabled checkpointing in kafka-delta-ingest. As expected, observed new checkpoint files created every 10th commit. However, for some unexpected reason each checkpoint file is more than 500Mb. More crucially, Dremio seems to have a problem with these checkpoint files and is unable to see most of the data in the table.

Dremio issue will be addressed elsewhere, but I am trying to understand what's causing such size differences. Here are metadata snapshots:

Spark checkpoint file metadata:

<pyarrow._parquet.FileMetaData object at 0xffffa93dbd30>
  created_by: parquet-mr version 1.13.1 (build db4183109d5b734ec5930d870cdae161e408ddba)
  num_columns: 55
  num_rows: 12316
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 13028

Kafka-delta-ingest checkpoint file metadata:

<pyarrow._parquet.FileMetaData object at 0xffff96be4bd0>
  created_by: parquet-rs version 49.0.0
  num_columns: 187
  num_rows: 12437
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 129335

Comparing schemas it appears that kafka-delta-ingest captures the file stats as stats_parsed not respecting the default of storing the stats as JSON string in stats, and this explains why there are many more columns. However, this does not explain order of magnitude difference in file size. If anything, JSON stats in Spark-generated checkpoint file should take more space than parsed stats.

What could contributing to such a difference in checkpoint file size, and is there something that could be done to reduce it?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generated checkpoint files are unexpectedly large and potentially broken #187

Generated checkpoint files are unexpectedly large and potentially broken #187

igor-lobanov-maersk commented Jul 19, 2024

Generated checkpoint files are unexpectedly large and potentially broken #187

Generated checkpoint files are unexpectedly large and potentially broken #187

Comments

igor-lobanov-maersk commented Jul 19, 2024