upgrade hudi version to 1.1#69

Open

vamsikarnika wants to merge 5 commits intomasterfrom

upgrade_hudi_version_1.1

vamsikarnika commented Sep 9, 2025

Description

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## Section
* Fix some things. ({issue}`issuenumber`)


          upgrade hudi version to 1.1

8279dd1

vamsikarnika requested a review from a team as a code owner

September 9, 2025 03:35


          hudi 1.1 upgrade

f6abc53

yihua reviewed

View reviewed changes

...n/trino-hudi/src/main/java/io/trino/plugin/hudi/query/index/HudiRecordLevelIndexSupport.java Outdated

    
                              // Perform index lookup in metadataTable

                              // TODO: document here what this map is keyed by

                              Map<String, HoodieRecordGlobalLocation> recordIndex = lazyTableMetadata.get().readRecordIndex(recordKeys);

                              Map<String, HoodieRecordGlobalLocation> recordIndex = HoodieDataUtils.dedupeAndCollectAsMap(lazyTableMetadata.get().readRecordIndexLocationsWithKeys(HoodieListData.eager(recordKeys)));

Collaborator

yihua Sep 10, 2025

Is dedup needed here as RLI does not have duplicate keys or it can be simplify collected into a map?

Author

vamsikarnika Sep 15, 2025

No, Have used this since it was method available for import. Will create a new method without duplication logic

plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/query/HudiSnapshotDirectoryLister.java Outdated

Comment on lines 57 to 58

		HoodieTableMetadata tableMetadata = lazyTableMetadata.get();
		HoodieTableFileSystemView fileSystemView = getFileSystemView(tableMetadata, metaClient);

Collaborator

yihua Sep 10, 2025

nit: unnecessary change

yihua reviewed

View reviewed changes

plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/reader/TrinoReaderContext.java Outdated

+              import java.util.Map;
+              import java.util.function.UnaryOperator;
+              public class TrinoReaderContext

Collaborator

yihua Sep 10, 2025

Suggested change

      
            public class TrinoReaderContext
          
            public class TrinoRecordContext

plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/reader/TrinoReaderContext.java Outdated

+              import java.util.Map;
+              import java.util.function.UnaryOperator;
+              public class TrinoReaderContext

Collaborator

yihua Sep 10, 2025

Is this class mostly similar to AvroRecordContext?

Collaborator

yihua Sep 10, 2025

Could we directly use AvroRecordContext? Once the merging logic is based on Page we can reimplement a new reader/record context.

plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/reader/TrinoReaderContext.java Outdated

Comment on lines 34 to 38

+                      if (bufferedRecord.isDelete()) {
+                          return new HoodieEmptyRecord<>(
+                                  new HoodieKey(bufferedRecord.getRecordKey(), partitionPath),
+                                  HoodieRecord.HoodieRecordType.AVRO);
+                      }

Collaborator

yihua Sep 10, 2025

What about the ordering value and payload class handling? Do we have test coverage around updates and deletes with lower and higher ordering values? The current logic can lead to data loss (because the ordering value is 0 indicating it's commit-time-ordered deletes causing the delete to take effect regardless of the ordering value, in the EVENT_TIME_ORDERING merge mode) if the deletes have lower ordering value.

Collaborator

yihua Sep 15, 2025 •

edited

Loading

Test cases to cover:

MOR table v6, base + log files, DefaultHoodiePayload (payload class), timestamp
MOR table v8, base + log files, EVENT_TIME_ORDERING (merge mode), timestamp
MOR table v9, base + log files, EVENT_TIME_ORDERING (merge mode), timestamp
MOR table v6, base + log files, OverwriteWithLatest (payload class)
MOR table v8, base + log files, COMMIT_TIME_ORDERING (merge mode)
MOR table v9, base + log files, COMMIT_TIME_ORDERING (merge mode)

Prepare the table in this sequence:

first batch: inserts (20 keys)
second batch: updates, with higher ordering values (5 keys), lower ordering values (other 5 keys)
third batch: deletes, with higher ordering values (3 keys), lower ordering values (other 3 keys)

Use Trino to read the tables and validate the result records.

plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/reader/TrinoReaderContext.java Outdated

+                                  HoodieRecord.HoodieRecordType.AVRO);
+                      }
+                      return new HoodieAvroIndexedRecord(bufferedRecord.getRecord());

Collaborator

yihua Sep 10, 2025

Similar here around payload class handling. We should add a test case on custom payload class.

plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/reader/TrinoReaderContext.java Outdated

+                      return null;
+                  }
+                  @Override

Collaborator

yihua Sep 10, 2025

For the methods that should not be called (i.e., from the write path), should they throw UnsupportedOperationException?

plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/reader/TrinoReaderContext.java Outdated

Comment on lines 108 to 112

+                      GenericRecord genericRecord = new GenericData.Record(schema);
+                      for (Schema.Field field : schema.getFields()) {
+                          genericRecord.put(field.name(), record.get(field.pos()));
+                      }
+                      return genericRecord;

Collaborator

yihua Sep 10, 2025

IndexedRecord can be casted to GenericRecord, so no need to reconstruct the record which introduces overhead?

plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/reader/TrinoReaderContext.java Outdated

Comment on lines 125 to 133

+                      // TODO: this can rely on colToPos map directly instead of schema
+                      Schema schema = record.getSchema();
+                      IndexedRecord newRecord = new GenericData.Record(schema);
+                      List<Schema.Field> fields = schema.getFields();
+                      for (Schema.Field field : fields) {
+                          int pos = schema.getField(field.name()).pos();
+                          newRecord.put(pos, record.get(pos));
+                      }
+                      return newRecord;

Collaborator

yihua Sep 10, 2025

Why not returning the record directly?

plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/reader/TrinoReaderContext.java Outdated

Comment on lines 143 to 158

+                  public UnaryOperator<IndexedRecord> projectRecord(Schema from, Schema to, Map<String, String> renamedColumns)
+                  {
+                      List<Schema.Field> toFields = to.getFields();
+                      int[] projection = new int[toFields.size()];
+                      for (int i = 0; i < projection.length; i++) {
+                          projection[i] = from.getField(toFields.get(i).name()).pos();
+                      }
+                      return fromRecord -> {
+                          IndexedRecord toRecord = new GenericData.Record(to);
+                          for (int i = 0; i < projection.length; i++) {
+                              toRecord.put(i, fromRecord.get(projection[i]));
+                          }
+                          return toRecord;
+                      };
+                  }

Collaborator

yihua Sep 10, 2025

Does this support nested fields and renames?

yihua reviewed

View reviewed changes

plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/reader/HudiTrinoReaderContext.java

Comment on lines -219 to -225

-                      if (bufferedRecord.isDelete()) {
-                          return new HoodieEmptyRecord<>(
-                                  new HoodieKey(bufferedRecord.getRecordKey(), null),
-                                  HoodieRecord.HoodieRecordType.AVRO);
-                      }
-                      return new HoodieAvroIndexedRecord(bufferedRecord.getRecord());

Collaborator

yihua Sep 10, 2025

So it looks like the logic in TrinoReaderContext is migrated from here. So we should use this opportunity to make sure the implementation is solid.

vamsikarnika added 2 commits

September 16, 2025 18:20


          fix failing realtime queries

d5abbf8


          fix test data

cbe1c3c

This comment was marked as outdated.

Sign in to view


          fix tests

a7fae7a

jonvex reviewed

View reviewed changes

plugin/trino-hudi/src/main/java/io/trino/plugin/hudi/reader/HudiTrinoRecord.java

+                  @Override
+                  public Object getColumnValueAsJava(Schema recordSchema, String column, Properties props)
+                  {
+                      return null;

jonvex Sep 22, 2025

This needs to be implemented correctly for column stats to work

voonhous Dec 1, 2025

This class is not used, i am gonna delete it.

voonhous mentioned this pull request

Upgrade hudi version 1.1 #74

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

yihua yihua left review comments

cursor[bot] cursor[bot] left review comments

+2 more reviewers

voonhous voonhous left review comments

jonvex jonvex left review comments

At least 1 approving review is required to merge this pull request.

Labels

None yet