Replies: 3 comments 12 replies
-
I am not sure if I have understood Improvement 1 correctly. |
Beta Was this translation helpful? Give feedback.
-
I have another question, from your example
If A is bigger than C, the watermark would be C at snapshot S2 too, when is the Watermark going to increase? |
Beta Was this translation helpful? Give feedback.
-
@majin1102 Thanks for kicking off this improvement. I have some confusion about this discussion.
|
Beta Was this translation helpful? Give feedback.
-
Let's talk about Mixed format Watermark implementation currently:
Whatever Flink or spark, a writer writing data whose event time between [A, B] will update Watermark to B - maxAllowLatency if new value > current Watermark. This implementation has two main issues:
It seems that Watermark in Mixed format currently can not guarantee time arrival using Spark or Flink. How can users use it to trigger downstream jobs for data arrival before a fix time?
I think there are three improvements which could help Watermark effectively used in production scenarios.
Make Watermark a more accurate semantics: data arrival guarantee for a fix time window! if Flink commits a snapshot S1 with event time range [A, B], then Watermark should be A for time window [S1, S1], and then Flink commits a snapshot S2 with event range[C, D], Watermark( no mater for S1 or S2) would be minimal(A, C) for time window [S1,S2] . Watermark would be a dynamic corrected value for data arrival guarantee, when users trust a watermark, he always trusts it under a certain time window, for example, I showed transaction list at 2:00AM to find earliest snapshot that guarantee watermark passed 24:00 in last day, finally I found Sn that matched my demand and I use Sn under this fixed time window of (Sn, Sk of 2:00AM).
We throw maxAllowLantency which has no meanings for table watermark. compute Watermark as Smallest event time of event time for snapshots and every watermark of certain snapshot could be reduced as new commits happening. Instead of maxAllowLantency we could define a property of WatermarkFixWindow to force a time window before a watermark could be sure.
Spark or batch writing does not update Watermark. Watermark is a metrics only for streaming operation.
Beta Was this translation helpful? Give feedback.
All reactions