You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As we known, Hudi proposed and introduced Bucket Index in RFC-29. Bucket Index can well unify the indexes of Flink and Spark, that is, Spark and Flink could upsert the same Hudi table using bucket index.
However, Bucket Index has a limit of fixed number of buckets. In order to solve this problem, RFC-42 proposed the ability of consistent hashing achieving bucket resizing by splitting or merging several local buckets dynamically.
But from PRD experience, sometimes a Partition-Level Bucket Index and a offline way to do bucket rescale is good enough without introducing additional efforts (multiple writes, clustering, automatic resizing,etc.). Because the more complex the Architecture, the more error-prone it is and the greater operation and maintenance pressure.
In this regard, we could upgrade the traditional Bucket Index to implement a Partition-Level Bucket Index, so that users can set a specific number of buckets for different partitions through a rule engine (such as regular expression matching). On the other hand, for a certain existing partitions, an off-line command is provided to reorganized the data using insert overwrite(need to stop the data writing of the current partition).
More importantly, the existing Bucket Index table can be upgraded to Partition-Level Bucket Index smoothly and seamlessly.
Some thoughts on this change? Any feedback would be greatly appreciated !
The text was updated successfully, but these errors were encountered:
Thanks for your attention. Yes, we will provide a new mechanism to specify the number of buckets for different partitions through an expression. For existing partitions, we can change current partition's bucket number through an offline job (Insert Overwrite) with the new bucket number. For new partitions, the initial value of the bucket number is obtained based on the expression. Additionally, updates to the expression are also supported, but the updated expression will only take effect for new partitions.
Hi Hudis:
As we known, Hudi proposed and introduced Bucket Index in RFC-29. Bucket Index can well unify the indexes of Flink and Spark, that is, Spark and Flink could upsert the same Hudi table using bucket index.
However, Bucket Index has a limit of fixed number of buckets. In order to solve this problem, RFC-42 proposed the ability of consistent hashing achieving bucket resizing by splitting or merging several local buckets dynamically.
But from PRD experience, sometimes a Partition-Level Bucket Index and a offline way to do bucket rescale is good enough without introducing additional efforts (multiple writes, clustering, automatic resizing,etc.). Because the more complex the Architecture, the more error-prone it is and the greater operation and maintenance pressure.
In this regard, we could upgrade the traditional Bucket Index to implement a Partition-Level Bucket Index, so that users can set a specific number of buckets for different partitions through a rule engine (such as regular expression matching). On the other hand, for a certain existing partitions, an off-line command is provided to reorganized the data using insert overwrite(need to stop the data writing of the current partition).
More importantly, the existing Bucket Index table can be upgraded to Partition-Level Bucket Index smoothly and seamlessly.
Some thoughts on this change? Any feedback would be greatly appreciated !
The text was updated successfully, but these errors were encountered: