Skip to content

Conversation

@shangxinli
Copy link
Contributor

Rewrite truncate(col) == "value" predicates to col STARTS_WITH "value" for string columns. This enables better predicate pushdown to storage formats and efficient use of prefix indexes.

The optimization only applies when:

  • Operation is equality
  • Term is a truncate transform
  • Literal is a string type

Added tests to verify correct application and edge cases.

@shangxinli shangxinli force-pushed the truncate_literal branch 9 times, most recently from a831452 to 5cbeca0 Compare November 3, 2025 18:51
@shangxinli shangxinli changed the title optimize truncate(col) == value to startsWith predicate feat: optimize truncate(col) == value to startsWith predicate Nov 3, 2025
Rewrite truncate(col, width) == "value" predicates to col STARTS_WITH "value"
for string columns when the literal has exactly `width` UTF-8 code points.
This enables better predicate pushdown to storage formats and efficient use
of prefix indexes.

The optimization applies when ALL conditions are met:
- Operation is equality
- Term is a truncate transform
- Literal is a string type
- Literal UTF-8 code point count equals truncate width (CRITICAL)
- Literal is not null (safety check)
- Width is not zero (Transform rejects this, but guard added for safety)

Correctly handles UTF-8 multi-byte characters:
- "José" = 4 code points, 5 bytes (é = 2 bytes)
- "Hi👋" = 3 code points, 6 bytes (👋 = 4 bytes)

Implementation details:
- Uses type-safe Transform::param() API (no brittle regex parsing)
- Binary string comparison semantics (no collation issues)
- Benefits from strict metrics evaluation for startsWith
- Counts code points (not grapheme clusters) per Iceberg spec
- Short-string invariance: when source < width, truncate returns full
  string, so equality implies exact match and STARTS_WITH remains valid

Added Transform::param() public API for accessing transform parameters,
replacing brittle ToString() regex parsing.

Comprehensive tests including UTF-8 edge cases and empty strings.
@wgtmac
Copy link
Member

wgtmac commented Nov 4, 2025

There are some merge conflicts...

@shangxinli
Copy link
Contributor Author

Fixed

/// This matches the behavior of TruncateUtils::TruncateUTF8.
///
/// NOTE: This counts code points, not grapheme clusters (user-perceived characters).
/// Per the Iceberg spec, combining marks count as separate code points.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iceberg spec does not have this. We should not mislead readers.

///
/// \param str The UTF-8 encoded string
/// \return The number of code points (not bytes, not graphemes)
inline int32_t CountUTF8CodePoints(std::string_view str) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should be moved to string_util.h or truncate_util.h

Comment on lines +285 to +290
auto* transform_term = dynamic_cast<BoundTransform*>(bound_term.get());
if (!transform_term) {
// Should never happen after kind check, but be defensive
return std::make_shared<BoundLiteralPredicate>(BASE::op(), std::move(bound_term),
std::move(literal));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto* transform_term = dynamic_cast<BoundTransform*>(bound_term.get());
if (!transform_term) {
// Should never happen after kind check, but be defensive
return std::make_shared<BoundLiteralPredicate>(BASE::op(), std::move(bound_term),
std::move(literal));
}
auto* transform_term = internal::checked_cast<BoundTransform*>(bound_term.get());

Let's reuse the existing pattern for cast.

!literal.IsNull()) { // Null safety: skip null literals

// Extract width parameter using type-safe API
auto width_opt = transform_term->transform()->param();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can greatly simplify this line and below.

We can add transform_func like below.

class ICEBERG_EXPORT BoundTransform : public BoundTerm {
 public:
  const std::shared_ptr<TransformFunction>& transform_func() const { return transform_func_; }
};

Then we just need to call transform_term->transform_func()->Transform(literal) and check if its return value is same as literal. We don't even need CountUTF8CodePoints above.

/// For non-parameterized transforms (identity, year, etc.), returns std::nullopt.
///
/// \return The parameter if present, otherwise std::nullopt
std::optional<int32_t> param() const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this function? This makes it harder to evolve the variant in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants