You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We use kafka-delta-ingest daemon to write data from a Kafka topic to a Delta Lake table on ADLS Gen2 account, and then use Dremio for processing the data onwards. There's about 10Gb of data saved daily, and the table size is approaching 1Tb right now. Up until recently we had not have metadata checkpointing enabled on the table and this led to queries becoming progressively slower.
First we tried checkpointing the table with Spark, and it worked well. The checkpoint parquet file generated by Spark is around 45Mb, and the query performance improved considerably.
Then we enabled checkpointing in kafka-delta-ingest. As expected, observed new checkpoint files created every 10th commit. However, for some unexpected reason each checkpoint file is more than 500Mb. More crucially, Dremio seems to have a problem with these checkpoint files and is unable to see most of the data in the table.
Dremio issue will be addressed elsewhere, but I am trying to understand what's causing such size differences. Here are metadata snapshots:
Spark checkpoint file metadata:
<pyarrow._parquet.FileMetaData object at 0xffffa93dbd30>
created_by: parquet-mr version 1.13.1 (build db4183109d5b734ec5930d870cdae161e408ddba)
num_columns: 55
num_rows: 12316
num_row_groups: 1
format_version: 1.0
serialized_size: 13028
Kafka-delta-ingest checkpoint file metadata:
<pyarrow._parquet.FileMetaData object at 0xffff96be4bd0>
created_by: parquet-rs version 49.0.0
num_columns: 187
num_rows: 12437
num_row_groups: 1
format_version: 1.0
serialized_size: 129335
Comparing schemas it appears that kafka-delta-ingest captures the file stats as stats_parsed not respecting the default of storing the stats as JSON string in stats, and this explains why there are many more columns. However, this does not explain order of magnitude difference in file size. If anything, JSON stats in Spark-generated checkpoint file should take more space than parsed stats.
What could contributing to such a difference in checkpoint file size, and is there something that could be done to reduce it?
The text was updated successfully, but these errors were encountered:
We use kafka-delta-ingest daemon to write data from a Kafka topic to a Delta Lake table on ADLS Gen2 account, and then use Dremio for processing the data onwards. There's about 10Gb of data saved daily, and the table size is approaching 1Tb right now. Up until recently we had not have metadata checkpointing enabled on the table and this led to queries becoming progressively slower.
First we tried checkpointing the table with Spark, and it worked well. The checkpoint parquet file generated by Spark is around 45Mb, and the query performance improved considerably.
Then we enabled checkpointing in kafka-delta-ingest. As expected, observed new checkpoint files created every 10th commit. However, for some unexpected reason each checkpoint file is more than 500Mb. More crucially, Dremio seems to have a problem with these checkpoint files and is unable to see most of the data in the table.
Dremio issue will be addressed elsewhere, but I am trying to understand what's causing such size differences. Here are metadata snapshots:
Spark checkpoint file metadata:
Kafka-delta-ingest checkpoint file metadata:
Comparing schemas it appears that kafka-delta-ingest captures the file stats as
stats_parsed
not respecting the default of storing the stats as JSON string instats
, and this explains why there are many more columns. However, this does not explain order of magnitude difference in file size. If anything, JSON stats in Spark-generated checkpoint file should take more space than parsed stats.What could contributing to such a difference in checkpoint file size, and is there something that could be done to reduce it?
The text was updated successfully, but these errors were encountered: