-
Notifications
You must be signed in to change notification settings - Fork 63
feat: optimize truncate(col) == value to startsWith predicate #289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
a831452 to
5cbeca0
Compare
Rewrite truncate(col, width) == "value" predicates to col STARTS_WITH "value" for string columns when the literal has exactly `width` UTF-8 code points. This enables better predicate pushdown to storage formats and efficient use of prefix indexes. The optimization applies when ALL conditions are met: - Operation is equality - Term is a truncate transform - Literal is a string type - Literal UTF-8 code point count equals truncate width (CRITICAL) - Literal is not null (safety check) - Width is not zero (Transform rejects this, but guard added for safety) Correctly handles UTF-8 multi-byte characters: - "José" = 4 code points, 5 bytes (é = 2 bytes) - "Hi👋" = 3 code points, 6 bytes (👋 = 4 bytes) Implementation details: - Uses type-safe Transform::param() API (no brittle regex parsing) - Binary string comparison semantics (no collation issues) - Benefits from strict metrics evaluation for startsWith - Counts code points (not grapheme clusters) per Iceberg spec - Short-string invariance: when source < width, truncate returns full string, so equality implies exact match and STARTS_WITH remains valid Added Transform::param() public API for accessing transform parameters, replacing brittle ToString() regex parsing. Comprehensive tests including UTF-8 edge cases and empty strings.
5cbeca0 to
ab63064
Compare
|
There are some merge conflicts... |
dfb4504 to
8076d44
Compare
|
Fixed |
| /// This matches the behavior of TruncateUtils::TruncateUTF8. | ||
| /// | ||
| /// NOTE: This counts code points, not grapheme clusters (user-perceived characters). | ||
| /// Per the Iceberg spec, combining marks count as separate code points. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iceberg spec does not have this. We should not mislead readers.
| /// | ||
| /// \param str The UTF-8 encoded string | ||
| /// \return The number of code points (not bytes, not graphemes) | ||
| inline int32_t CountUTF8CodePoints(std::string_view str) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this should be moved to string_util.h or truncate_util.h
| auto* transform_term = dynamic_cast<BoundTransform*>(bound_term.get()); | ||
| if (!transform_term) { | ||
| // Should never happen after kind check, but be defensive | ||
| return std::make_shared<BoundLiteralPredicate>(BASE::op(), std::move(bound_term), | ||
| std::move(literal)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| auto* transform_term = dynamic_cast<BoundTransform*>(bound_term.get()); | |
| if (!transform_term) { | |
| // Should never happen after kind check, but be defensive | |
| return std::make_shared<BoundLiteralPredicate>(BASE::op(), std::move(bound_term), | |
| std::move(literal)); | |
| } | |
| auto* transform_term = internal::checked_cast<BoundTransform*>(bound_term.get()); |
Let's reuse the existing pattern for cast.
| !literal.IsNull()) { // Null safety: skip null literals | ||
|
|
||
| // Extract width parameter using type-safe API | ||
| auto width_opt = transform_term->transform()->param(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can greatly simplify this line and below.
We can add transform_func like below.
class ICEBERG_EXPORT BoundTransform : public BoundTerm {
public:
const std::shared_ptr<TransformFunction>& transform_func() const { return transform_func_; }
};Then we just need to call transform_term->transform_func()->Transform(literal) and check if its return value is same as literal. We don't even need CountUTF8CodePoints above.
| /// For non-parameterized transforms (identity, year, etc.), returns std::nullopt. | ||
| /// | ||
| /// \return The parameter if present, otherwise std::nullopt | ||
| std::optional<int32_t> param() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove this function? This makes it harder to evolve the variant in the future.
Rewrite truncate(col) == "value" predicates to col STARTS_WITH "value" for string columns. This enables better predicate pushdown to storage formats and efficient use of prefix indexes.
The optimization only applies when:
Added tests to verify correct application and edge cases.