Skip to content

Comments

[HUDI-9480] Add Expression Index Support#25

Open
voonhous wants to merge 3 commits intoonehouseinc:masterfrom
voonhous:expression_index
Open

[HUDI-9480] Add Expression Index Support#25
voonhous wants to merge 3 commits intoonehouseinc:masterfrom
voonhous:expression_index

Conversation

@voonhous
Copy link

Description

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## Section
* Fix some things. ({issue}`issuenumber`)

@github-actions github-actions bot added the hudi label May 30, 2025
@voonhous voonhous force-pushed the expression_index branch from 75effda to f79b19d Compare May 30, 2025 13:10
@voonhous voonhous marked this pull request as draft May 30, 2025 13:11
@voonhous voonhous force-pushed the expression_index branch from 1371d38 to 15d408f Compare June 6, 2025 08:29
@voonhous voonhous marked this pull request as ready for review June 6, 2025 08:30
@voonhous voonhous force-pushed the expression_index branch from 11c470a to 5a58655 Compare June 6, 2025 11:41
@voonhous voonhous closed this Jun 6, 2025
@voonhous voonhous reopened this Jun 6, 2025
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the user also explicitly disable the expression index if there is an issue or regression? If so, let's revise the config description.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, all configurations that are added can be explicitly disabled.

Comment on lines 128 to 131
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Preconditions#checkArgument

Comment on lines 158 to 233
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: pull arguments.size() == 1 out of the condition check and put it into Preconditions#checkArgument; similar for other else if branches.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could unit tests be added on this class given this is critical?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any inefficiency of having two levels of expression during predicate evaluation, instead of sth like a Predicates.neq?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be negligible as it will only be evaluated at most once. This is written the way it is as Predicates belongs to Hudi project scope: org.apache.hudi.expression.Predicates and does not have the neq function.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of "IN"(column, ARRAY[val1, val2...]), does the second argument only contain ARRAY[val1, val2...] and Hudi's Predicates supports it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or should ARRAY[val1, val2...] be unfolded to adapt to Hudi's Predicates?

Copy link
Author

@voonhous voonhous Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this in the new commit. However, IN operators are not used for expression index file skipping for now, i.e. not used in io.trino.plugin.hudi.expression.HudiColumnStatsIndexEvaluator

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String name can be remove as not used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use getName for matching? A static import of TRINO_FN_NAME can easily make it confusing.

Suggested change
else if (HudiTrinoDayFunctionExpression.TRINO_FN_NAME.equalsIgnoreCase(functionName.getName())) {
else if (HudiTrinoDayFunctionExpression.getName().equalsIgnoreCase(functionName.getName())) {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called before initializing an instance, so, a static function is required regardless. Will change TRINO_FN_NAME to getTrinoFnName().

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar for the rest.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this general expression get evaluated without problem?

Copy link
Author

@voonhous voonhous Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is suppose to be a catch-all similar to the default function which will be evaluated, but will not be used downstream.

I am currently trying to get nested types to be evaluated into this branch.

TLDR: There are some issues that will cause this general expression to not be evaluated/reached.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the specific functions for expression index, should they be put into a separate method for handling so it's more clear?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was thinking of this, but not really sure how to do it as initialisation of the concrete implementation is currently a one liner. It is concise and readable and i can't think of any better way to make it more readable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this collection only contain one or no entry based on the logic right now?

Copy link
Author

@voonhous voonhous Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tryAddAsCandidate will attempt to add candidates. It will contain contain at least one or no entry based on the logic.

The current smoke tests that were added are example/cases where there are more than one entry.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the ExpressionConverter be a singleton and expose a method to convert the expression, and let the upper layer to handle whether they should be used with expression index?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically, all the predicates or expressions are still kept and re-evaluated by Trino, correct?

Copy link
Author

@voonhous voonhous Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, what we have added is outside the scope of what uses, so, it will not affect Trino's optimizations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the predicate is in the form of day(ts) > 1 and day(ts) < 10, it seems like it is not a candidate for the expression index. Only day(ts) > 1 is. Is that the case, i.e., complex or nested predicate with expression wrapped might not be eligible for expression index pruning?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

From my tests, nested predicates can be handled.

@voonhous voonhous force-pushed the expression_index branch 2 times, most recently from 149e93d to f4ee420 Compare June 20, 2025 13:02
@voonhous
Copy link
Author

Resolved merge conflicts, going through the comments incrementally.

@voonhous voonhous closed this Jun 23, 2025
@voonhous voonhous reopened this Jun 23, 2025
- Fix rebase errors
- Added unit tests for ExpressionConverter
- Fix checkstyle
- Address comments
@voonhous voonhous force-pushed the expression_index branch 2 times, most recently from 81d441d to 769e0a5 Compare June 26, 2025 09:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants