Sync parquet to other file formats #592

sudhar91 · 2024-12-07T07:38:35Z

Issue : #553

What is the purpose of the pull request

This PR is for changes made to sync parquet to delta ,hudi & Iceberg

Brief change log

Changes to sync from parquet to delta ,hudi & iceberg
Handle incremental sync

Co Authored by : @sundarshankar89

Signed-off-by: sudharshanr <sudhar91.it@gmail.com>

the-other-tim-brown · 2024-12-08T20:09:32Z

xtable-api/src/main/java/org/apache/xtable/model/storage/TableFormat.java

@@ -27,8 +27,9 @@ public class TableFormat {
  public static final String HUDI = "HUDI";
  public static final String ICEBERG = "ICEBERG";
  public static final String DELTA = "DELTA";
+  public static final String PARQUET="PARQUET";


Please run mvn spotless:apply to clean up some of the formatting issues in the draft

the-other-tim-brown · 2024-12-08T20:09:55Z

xtable-core/src/main/java/org/apache/xtable/parquet/FileSystemHelper.java

+package org.apache.xtable.parquet;
+
+import java.io.IOException;
+import java.util.*;


nitpick: we're avoiding the use of * imports in this repo

the-other-tim-brown · 2024-12-08T20:14:28Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetConversionSource.java

+    Optional<LocatedFileStatus> latestFile =
+        fsHelper
+            .getParquetFiles(hadoopConf, basePath)
+            .max(Comparator.comparing(FileStatus::getModificationTime));


Is there a way to push down this filter so we don't need to iterate through all files under the base path? Maybe we can even limit the file listing to return files created after the modificationTime?

i dont think we can push down with the api , even to filter files greater than modification we have to first list and then filter out, do u have any other idea on your mind for this?

No, just wanted to see if it was possible to help with large tables

the-other-tim-brown · 2024-12-08T20:16:04Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetConversionSource.java

+    SchemaBuilder.FieldAssembler<Schema> fieldAssembler =
+        SchemaBuilder.record(internalSchema.getName()).fields();
+    for (Schema.Field field : internalSchema.getFields()) {
+      fieldAssembler = fieldAssembler.name(field.name()).type(field.schema()).noDefault();


Can the internal schema have defaults? Can it also have docs on fields? those would be dropped with this code

the-other-tim-brown · 2024-12-08T20:16:24Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetConversionSource.java

+  @Builder.Default private static final FileSystemHelper fsHelper = FileSystemHelper.getInstance();
+
+  @Builder.Default
+  private static final ParquetMetadataExtractor parquetMetadataExtractor =


Is there an implementation for this class missing?

i added initially but not using it will remove it

the-other-tim-brown · 2024-12-08T20:21:37Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetPartitionHelper.java

+      InternalSchema schema,
+      Map<String, List<String>> partitionInfo) {
+    List<PartitionValue> partitionValues = new ArrayList<>();
+    java.nio.file.Path base = Paths.get(basePath).normalize();


If you're always going to convert the basePath, you should try to find a way to convert it once in the caller and pass it in.

Noted , overall i need to simplify the logic as it is bit complicated

the-other-tim-brown · 2024-12-08T20:23:14Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetConversionSource.java

+        latestFile.stream()
+            .map(
+                file ->
+                    InternalDataFile.builder()


Can the logic for this conversion move to a common method that can also be called from the getTableChangeForCommit?

the-other-tim-brown · 2024-12-08T20:26:06Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetConversionSource.java

+import org.apache.xtable.spi.extractor.ConversionSource;
+
+@Builder
+public class ParquetConversionSource implements ConversionSource<Long> {


I think it may be a bit more robust if we use a time interval instead of a single long here. Then you will be able to draw a clear boundary for each run of the conversion source, what are you thoughts?

Can u explain a bit on the interval and how are u envisioning ? this long is the last synced modification time of file so in next run list files greater mod time so new files are synced.

I was thinking of an interval since it can also easily show where the start time was for the sync. This could be useful when the targets fall out of sync with each other. Currently if there are commits 1, 2, and 3 in the source and Target1 only is synced to 1 but Target2 is synced to 2, the incremental sync can sync 2 and 3 to Target1 and only 3 to Target2 as part of the same sync. I am not sure what that will look like for this source so I was thinking intervals can help us define these "commits" but I need to think through it some more.

the-other-tim-brown · 2024-12-08T20:28:44Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetConversionSourceProvider.java

+return ParquetConversionSource.builder()
+        .tableName(sourceTable.getName())
+        .basePath(sourceTable.getBasePath())
+        .hadoopConf(new Configuration())


there is an init method called with the hadoop configuration, you should be able to simply use hadoopConf here

the-other-tim-brown · 2024-12-08T20:31:24Z

xtable-core/src/main/java/org/apache/xtable/parquet/FileSystemHelper.java

+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.*;
+
+public class FileSystemHelper {


I think it would be a good idea to define an interface for getting the parquet files for the table and changes since the last run. Right now this is all being done through file listing but we should consider a case where someone implements a way to poll the changes through s3 events.

the-other-tim-brown · 2024-12-08T20:33:42Z

@sudhar91 this is a great step forward on this feature! My main request is to pull some of these classes that are focused on conversion such as the partition and data file conversion into their own PR with unit tests written. It will be easier and quicker to get those straightforward changes reviewed while the rest of the details are figured out.

the-other-tim-brown · 2024-12-08T20:34:41Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetConversionSource.java

+    Map<String, List<String>> partitionInfo = initPartitionInfo();
+    for (FileStatus tableStatus : tableChanges) {
+      internalDataFiles.add(
+          InternalDataFile.builder()


We also need recordCount which we should be able to get from the column stats

sudhar91 · 2024-12-15T14:03:30Z

@sudhar91 this is a great step forward on this feature! My main request is to pull some of these classes that are focused on conversion such as the partition and data file conversion into their own PR with unit tests written. It will be easier and quicker to get those straightforward changes reviewed while the rest of the details are figured out.

Thanks for taking ur time to review , i couldnt push my unit test because i was having a weird jar conflicts on my test will fix them next week, i raised this PR because wanted to validate whether my approach is right in terms of this feature :)

sudhar91 added 5 commits December 2, 2024 07:48

initial commit to support parquet

951b31c

Code enhancements

24efa0f

First version of code for syncing parquet format

9a4690f

Signed-off-by: sudharshanr <sudhar91.it@gmail.com>

delete unused files

fc4dfab

remove configs used in dev testing

c9f52f8

the-other-tim-brown reviewed Dec 8, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync parquet to other file formats #592

Sync parquet to other file formats #592

sudhar91 commented Dec 7, 2024

the-other-tim-brown Dec 8, 2024

the-other-tim-brown Dec 8, 2024

the-other-tim-brown Dec 8, 2024

sudhar91 Dec 15, 2024

the-other-tim-brown Dec 23, 2024

the-other-tim-brown Dec 8, 2024

the-other-tim-brown Dec 8, 2024

sudhar91 Dec 15, 2024

the-other-tim-brown Dec 8, 2024

sudhar91 Dec 15, 2024

the-other-tim-brown Dec 8, 2024

the-other-tim-brown Dec 8, 2024

sudhar91 Dec 15, 2024

the-other-tim-brown Dec 23, 2024

the-other-tim-brown Dec 8, 2024

the-other-tim-brown Dec 8, 2024

the-other-tim-brown commented Dec 8, 2024

the-other-tim-brown Dec 8, 2024

sudhar91 commented Dec 15, 2024

Sync parquet to other file formats #592

Are you sure you want to change the base?

Sync parquet to other file formats #592

Conversation

sudhar91 commented Dec 7, 2024

What is the purpose of the pull request

Brief change log

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

the-other-tim-brown commented Dec 8, 2024

Choose a reason for hiding this comment

sudhar91 commented Dec 15, 2024