Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generated checkpoint files are unexpectedly large and potentially broken #187

Open
igor-lobanov-maersk opened this issue Jul 19, 2024 · 0 comments

Comments

@igor-lobanov-maersk
Copy link

We use kafka-delta-ingest daemon to write data from a Kafka topic to a Delta Lake table on ADLS Gen2 account, and then use Dremio for processing the data onwards. There's about 10Gb of data saved daily, and the table size is approaching 1Tb right now. Up until recently we had not have metadata checkpointing enabled on the table and this led to queries becoming progressively slower.

First we tried checkpointing the table with Spark, and it worked well. The checkpoint parquet file generated by Spark is around 45Mb, and the query performance improved considerably.

Then we enabled checkpointing in kafka-delta-ingest. As expected, observed new checkpoint files created every 10th commit. However, for some unexpected reason each checkpoint file is more than 500Mb. More crucially, Dremio seems to have a problem with these checkpoint files and is unable to see most of the data in the table.

Dremio issue will be addressed elsewhere, but I am trying to understand what's causing such size differences. Here are metadata snapshots:

Spark checkpoint file metadata:

<pyarrow._parquet.FileMetaData object at 0xffffa93dbd30>
  created_by: parquet-mr version 1.13.1 (build db4183109d5b734ec5930d870cdae161e408ddba)
  num_columns: 55
  num_rows: 12316
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 13028

Kafka-delta-ingest checkpoint file metadata:

<pyarrow._parquet.FileMetaData object at 0xffff96be4bd0>
  created_by: parquet-rs version 49.0.0
  num_columns: 187
  num_rows: 12437
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 129335

Comparing schemas it appears that kafka-delta-ingest captures the file stats as stats_parsed not respecting the default of storing the stats as JSON string in stats, and this explains why there are many more columns. However, this does not explain order of magnitude difference in file size. If anything, JSON stats in Spark-generated checkpoint file should take more space than parsed stats.

What could contributing to such a difference in checkpoint file size, and is there something that could be done to reduce it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant