Skip to content

Comments

upgrade hudi version to 1.1#69

Open
vamsikarnika wants to merge 5 commits intomasterfrom
upgrade_hudi_version_1.1
Open

upgrade hudi version to 1.1#69
vamsikarnika wants to merge 5 commits intomasterfrom
upgrade_hudi_version_1.1

Conversation

@vamsikarnika
Copy link

Description

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## Section
* Fix some things. ({issue}`issuenumber`)

@vamsikarnika vamsikarnika requested a review from a team as a code owner September 9, 2025 03:35
// Perform index lookup in metadataTable
// TODO: document here what this map is keyed by
Map<String, HoodieRecordGlobalLocation> recordIndex = lazyTableMetadata.get().readRecordIndex(recordKeys);
Map<String, HoodieRecordGlobalLocation> recordIndex = HoodieDataUtils.dedupeAndCollectAsMap(lazyTableMetadata.get().readRecordIndexLocationsWithKeys(HoodieListData.eager(recordKeys)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is dedup needed here as RLI does not have duplicate keys or it can be simplify collected into a map?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, Have used this since it was method available for import. Will create a new method without duplication logic

Comment on lines 57 to 58
HoodieTableMetadata tableMetadata = lazyTableMetadata.get();
HoodieTableFileSystemView fileSystemView = getFileSystemView(tableMetadata, metaClient);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unnecessary change

import java.util.Map;
import java.util.function.UnaryOperator;

public class TrinoReaderContext
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public class TrinoReaderContext
public class TrinoRecordContext

import java.util.Map;
import java.util.function.UnaryOperator;

public class TrinoReaderContext
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this class mostly similar to AvroRecordContext?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we directly use AvroRecordContext? Once the merging logic is based on Page we can reimplement a new reader/record context.

Comment on lines 34 to 38
if (bufferedRecord.isDelete()) {
return new HoodieEmptyRecord<>(
new HoodieKey(bufferedRecord.getRecordKey(), partitionPath),
HoodieRecord.HoodieRecordType.AVRO);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the ordering value and payload class handling? Do we have test coverage around updates and deletes with lower and higher ordering values? The current logic can lead to data loss (because the ordering value is 0 indicating it's commit-time-ordered deletes causing the delete to take effect regardless of the ordering value, in the EVENT_TIME_ORDERING merge mode) if the deletes have lower ordering value.

Copy link
Collaborator

@yihua yihua Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test cases to cover:

  1. MOR table v6, base + log files, DefaultHoodiePayload (payload class), timestamp
  2. MOR table v8, base + log files, EVENT_TIME_ORDERING (merge mode), timestamp
  3. MOR table v9, base + log files, EVENT_TIME_ORDERING (merge mode), timestamp
  4. MOR table v6, base + log files, OverwriteWithLatest (payload class)
  5. MOR table v8, base + log files, COMMIT_TIME_ORDERING (merge mode)
  6. MOR table v9, base + log files, COMMIT_TIME_ORDERING (merge mode)

Prepare the table in this sequence:

  1. first batch: inserts (20 keys)
  2. second batch: updates, with higher ordering values (5 keys), lower ordering values (other 5 keys)
  3. third batch: deletes, with higher ordering values (3 keys), lower ordering values (other 3 keys)

Use Trino to read the tables and validate the result records.

HoodieRecord.HoodieRecordType.AVRO);
}

return new HoodieAvroIndexedRecord(bufferedRecord.getRecord());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar here around payload class handling. We should add a test case on custom payload class.

return null;
}

@Override
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the methods that should not be called (i.e., from the write path), should they throw UnsupportedOperationException?

Comment on lines 108 to 112
GenericRecord genericRecord = new GenericData.Record(schema);
for (Schema.Field field : schema.getFields()) {
genericRecord.put(field.name(), record.get(field.pos()));
}
return genericRecord;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IndexedRecord can be casted to GenericRecord, so no need to reconstruct the record which introduces overhead?

Comment on lines 125 to 133
// TODO: this can rely on colToPos map directly instead of schema
Schema schema = record.getSchema();
IndexedRecord newRecord = new GenericData.Record(schema);
List<Schema.Field> fields = schema.getFields();
for (Schema.Field field : fields) {
int pos = schema.getField(field.name()).pos();
newRecord.put(pos, record.get(pos));
}
return newRecord;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not returning the record directly?

Comment on lines 143 to 158
public UnaryOperator<IndexedRecord> projectRecord(Schema from, Schema to, Map<String, String> renamedColumns)
{
List<Schema.Field> toFields = to.getFields();
int[] projection = new int[toFields.size()];
for (int i = 0; i < projection.length; i++) {
projection[i] = from.getField(toFields.get(i).name()).pos();
}

return fromRecord -> {
IndexedRecord toRecord = new GenericData.Record(to);
for (int i = 0; i < projection.length; i++) {
toRecord.put(i, fromRecord.get(projection[i]));
}
return toRecord;
};
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this support nested fields and renames?

Comment on lines -219 to -225
if (bufferedRecord.isDelete()) {
return new HoodieEmptyRecord<>(
new HoodieKey(bufferedRecord.getRecordKey(), null),
HoodieRecord.HoodieRecordType.AVRO);
}

return new HoodieAvroIndexedRecord(bufferedRecord.getRecord());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it looks like the logic in TrinoReaderContext is migrated from here. So we should use this opportunity to make sure the implementation is solid.

cursor[bot]

This comment was marked as outdated.

@Override
public Object getColumnValueAsJava(Schema recordSchema, String column, Properties props)
{
return null;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be implemented correctly for column stats to work

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is not used, i am gonna delete it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants