[FLINK-35749] Kafka sink component will lose data when kafka cluster is unavailable for a while #107

JimmyZZZ · 2024-07-05T10:23:46Z

What is the purpose of the change

Fix the bug for losing data during kafka cluster is unavailable for a while.
The related issue: https://issues.apache.org/jira/browse/FLINK-35749 tells how to reproduce this bug.

Brief change log

The key problem is in WriterCallback#onCompletion of KafkaWriter:

                mailboxExecutor.submit(
                      () -> {
                          // Checking for exceptions from previous writes
                          checkAsyncException();
                      },
                      "Update error metric");

'mailboxExecutor.submit' without getting future back will not throw exception, which causes the 'asyncProducerException' is assign to null but the job seems like nothing happened. The fix is using 'mailboxExecutor.execute' instead of 'mailboxExecutor.submit'.

Verifying this change

Added new test case KafkaWriterFaultToleranceITCase to cover this bug.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): no
Anything that affects deployment or recovery: no
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? no
If yes, how is the feature documented? not applicable

boring-cyborg · 2024-07-05T10:23:51Z

Thanks for opening this pull request! Please check out our contributing guidelines. (https://flink.apache.org/contributing/how-to-contribute.html)

AHeise

Thank you very much for your contribution. The actual fix looks good but you have some additional changes that are not necessary imho.

We also need to add a test case for the fix. In your description, you said it's easy to simulate when you set retries to a low value (don't use 0) and then Kafka becomes unavailable. So please add a test to KafkaWriterITCase where you KAFKA_CONTAINER.stop() and check for the result. Don't forget to start it again in a finally block though. It might even be safer to create a new ITCase like KafkaWriterFaultToleranceITCase where you have more control over the container.

AHeise · 2024-07-05T11:27:23Z

flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java

+                // fail over, which has been fixed in [FLINK-31305]. And using
+                // mailboxExecutor.execute() is better than
+                // mailboxExecutor.submit() since submit will swallow exceptions inside.
+                mailboxExecutor.execute(


Yes, this is the actual fix. Using execute instead of submit. This bug got introduced through FLINK-31305.

Note: submit does not swallow exceptions. It's an anti-pattern to use submit without looking at the Future. The future holds the result/exception to be handled async somewhere. If we bubble up the exception the task dies without the call-site being able to properly react to it.
So please remove the second part of your comment from "And using ...".

I will amend the javadoc of the mailbox executor to make it clear that the executor indeed does not bubble up exception in submit.

I agree. My description is not exactly, if we use submit and get the future returned, the exception will also be thrown.

I'll remove this part: "And using ...".

@AHeise removed now

Good catch, thanks for fixing this!

As a follow-up, I opened and resolved https://issues.apache.org/jira/browse/FLINK-35796.

AHeise · 2024-07-05T11:35:17Z

flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaWriter.java

+            // In old version, asyncProducerException will be set to null here, which causes another
+            // bug [FLINK-35749]
+            // once asyncProducerException is set to null here, the MailboxExecutor thread in
+            // WriterCallback.onCompletion may also invoke this method checkAsyncException and get
+            // chance to make
+            // asyncProducerException be null before the function flush() and write() seeing the
+            // exception.
+            // After that, the flush and writer functions can continue to write next message batch
+            // to kafka, the old
+            // in-flight message batch with exceptions get lost.
+            // We hope that once some exceptions thrown during sending kafka, KafkaWriter could be
+            // fast-failed and
+            // trigger global restarting, then asyncProducerException will be initialized as null
+            // again.


I don't think this change is necessary at all. You can see that we now have inflated error count metrics in all the tests that you adjusted (and should be reverted as well).

The issue is not that we reset the exception. It's that it doesn't bubble up (see below). We actually have a clear contract on the volatile asyncProducerException: it's set in the onCompletion and read+reset in the main thread (which is the mailbox thread). So we never have a chance to concurrently reset the exception and lose it somehow.

Thanks for remind, I need some time to think over about this part

@AHeise I've remove the modification before about the reset logic in checkAsyncException(), and i understand what you mean about "we never have a chance to concurrently reset the exception" and agree with you.

Btw, I'm still little curious: if we don't reset the asyncProducerException to null in checkAsyncException(), what problems we will meet? As the comment from [FLINK-31305] : Propagate the first exception since amount of exceptions could be large. Is this the only reason that we throw the Exception for only one time? This seems not so persuasive for me.

If we keep the asyncProducerException until KafkaWriter task is restarted and assign to asyncProducerException null again, that seems safer if someone modifies the related logic later...

It's a good point to raise if we need to reset at all or not. I'm assuming @mas-chen did it to emulate the previous behavior.

From what I can see, the only real downside is that we rethrow the exception in close (which is always called also in case of errors). That could prevent some proper cleanup and will count exceptions twice in most cases, which will inflate the metrics. I think the latter point has quite a bit of side-effects:

I'm expecting most customers to have some alerts on the exceptions metrics, which could mean that it's a behavior change that would alert more users than in previous versions. So in the worst case, some poor data engineer will be paged in the middle of the night because of that. And I would expect a few alerts to be changed to be less noisy.

All in all, I think that resetting the exception would be good IF we are all sure that there won't be any exception missed. Being sure that we can't have corrupted state is of course more important than having accurate metrics but ideally we have both.

Very good consideration for "data engineer will be paged in the middle of the night" once the alerts on the exceptions metrics changes. Then I think we need to be very cautious for the following modification about the exception logic. And the new test KafkaWriterFaultToleranceITCase can play a role for this concerns.

AHeise · 2024-07-05T11:54:07Z

Please also add some more details to the PR description and commit message in accordance with the guidlines. In particular, please use the proper Github PR template.

…is unavailable for a while

…e for kafka exception

…t class

JimmyZZZ · 2024-07-08T17:31:44Z

@AHeise added new test case KafkaWriterFaultToleranceITCase and do some refactor to extract some public things for KafkaWriterFaultToleranceITCase and KafkaWriterITCase, and also provided more details in this pr as Github PR template.

Pls help to review again, thanks very much.

AHeise

LAGTM. Production code is good. Test code is exhaustive and still minimal. I just added a remark on the producer settings in the test. We may be able to speed up the tests significantly. PTAL.

Do you need a committer that pushes the code for you? LMK if that's the case. I still need to get back into the community to understand who is committer and who is contributor ;).

AHeise · 2024-07-09T07:51:30Z

...fka/src/test/java/org/apache/flink/connector/kafka/sink/KafkaWriterFaultToleranceITCase.java

+public class KafkaWriterFaultToleranceITCase extends KafkaWriterTestBase {
+    private static final String INIT_KAFKA_RETRIES = "1";
+    private static final String INIT_KAFKA_REQUEST_TIMEOUT_MS = "2000";
+    private static final String INIT_KAFKA_MAX_BLOCK_MS = "3000";
+    private static final String INIT_KAFKA_DELIVERY_TIMEOUT_MS = "4000";


Tests look good. I'm wondering if we can/need to tweak these numbers to make the test hopefully not flaky.

We have to make good tradeoffs here:

The overall test shouldn't take too long and we are relying on timeouts here.

Timeouts shouldn't be too small to compensate for infra hiccups.

Retries should be non-zero.

On first glance, the values all look good in that regard.

However, now at a second thought: do we ever want to have a successful request in this test? Maybe we could set really low timeouts and retry = 0 and provoke the failures even quicker?

Tests look good. I'm wondering if we can/need to tweak these numbers to make the test hopefully not flaky.

We have to make good tradeoffs here:

The overall test shouldn't take too long and we are relying on timeouts here.

Timeouts shouldn't be too small to compensate for infra hiccups.

Retries should be non-zero.

On first glance, the values all look good in that regard.

However, now at a second thought: do we ever want to have a successful request in this test? Maybe we could set really low timeouts and retry = 0 and provoke the failures even quicker?

Yes, the purpose is to make timeout exception, retry=0 or 1 has no difference here. Have modified retry = 0 now.

Since we want to mock the scene of kafka unavailable, I'm worrying about the bad network of the test machine will also cause timeout too, so I keep other timeout thresholds at least 1 second here

AHeise · 2024-07-09T07:54:31Z

...connector-kafka/src/test/java/org/apache/flink/connector/kafka/sink/KafkaWriterTestBase.java

+import static org.apache.flink.connector.kafka.testutils.KafkaUtil.createKafkaContainer;
+
+/** Test base for KafkaWriter. */
+public abstract class KafkaWriterTestBase {


Thank you very much for avoiding duplicate code.

…xception more quickly

JimmyZZZ · 2024-07-09T09:31:38Z

LAGTM. Production code is good. Test code is exhaustive and still minimal. I just added a remark on the producer settings in the test. We may be able to speed up the tests significantly. PTAL.

Do you need a committer that pushes the code for you? LMK if that's the case. I still need to get back into the community to understand who is committer and who is contributor ;).

@AHeise Yes, I need your help to reach to someone can pushes the code since I'm also not very sure who can do it. Thanks a lot.

…ore quickly

AHeise · 2024-07-10T06:27:36Z

There is an arch unit rule violation:

2024-07-09T13:48:15.4851704Z [ERROR]   Architecture Violation [Priority: MEDIUM] - Rule 'ITCASE tests should use a MiniCluster resource or extension' was violated (1 times):
2024-07-09T13:48:15.4853792Z org.apache.flink.connector.kafka.sink.KafkaWriterFaultToleranceITCase does not satisfy: only one of the following predicates match:
2024-07-09T13:48:15.4856292Z * reside in a package 'org.apache.flink.runtime.*' and contain any fields that are static, final, and of type InternalMiniClusterExtension and annotated with @RegisterExtension
2024-07-09T13:48:15.4859585Z * reside outside of package 'org.apache.flink.runtime.*' and contain any fields that are static, final, and of type MiniClusterExtension and annotated with @RegisterExtension or are , and of type MiniClusterTestEnvironment and annotated with @TestEnv
2024-07-09T13:48:15.4862798Z * reside in a package 'org.apache.flink.runtime.*' and is annotated with @ExtendWith with class InternalMiniClusterExtension
2024-07-09T13:48:15.4864813Z * reside outside of package 'org.apache.flink.runtime.*' and is annotated with @ExtendWith with class MiniClusterExtension
2024-07-09T13:48:15.4871237Z  or contain any fields that are public, static, and of type MiniClusterWithClientResource and final and annotated with @ClassRule or contain any fields that is of type MiniClusterWithClientResource and public and final and not static and annotated with @Rule

Digged into it:

The rule doesn't make any sense for ITCases and should be reserved for E2E tests. Please add it to the known archunit violations and I will clean up at a later point.

…Case

JimmyZZZ · 2024-07-10T07:16:18Z

Added KafkaWriterFaultToleranceITCase to the known archunit violations, pls take a look. @AHeise

AHeise · 2024-07-10T07:35:33Z

At this point, we just try to get the build green. You can check locally by running mvn verify.

…Case

JimmyZZZ · 2024-07-11T03:34:24Z

At this point, we just try to get the build green. You can check locally by running mvn verify.

checked, should be green now.

AHeise · 2024-07-11T06:34:41Z

At this point, we just try to get the build green. You can check locally by running mvn verify.

checked, should be green now.

Retriggering CI.

JimmyZZZ · 2024-07-11T09:21:30Z

@AHeise It's strange that only this test failed: compile_and_test (1.17.2, 8, 11) / compile_and_test (8)

I digged into it and found two Test modified in this PR are green:
2024-07-11T06:44:35.4459342Z [INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 37.521 s - in org.apache.flink.connector.kafka.sink.KafkaWriterFaultToleranceITCase
2024-07-11T06:43:56.5827523Z [INFO] Tests run: 19, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 14.883 s - in org.apache.flink.connector.kafka.sink.KafkaWriterITCase

and seems something wrong abount the env or network because I found this log in the failed test: 2024-07-11T07:19:03.3807103Z ##[error]The action 'Compile and test' has timed out after 50 minutes.

So could you help to take a look or just trigger again

AHeise · 2024-07-11T10:59:28Z

The hanging test comes from the source DynamicKafkaSourceITTest.IntegrationTests.testIdleReader. I don't see any sink usage in the test, so the failure is unrelated.

I'm rerunning the job and will merge soonish.

boring-cyborg · 2024-07-11T11:39:47Z

Awesome work, congrats on your first merged pull request!

boring-cyborg bot added the component=Connectors/Kafka label Jul 5, 2024

AHeise requested changes Jul 5, 2024

View reviewed changes

[FLINK-35749] Kafka sink component will lose data when kafka cluster …

c9c8b61

…is unavailable for a while

JimmyZZZ force-pushed the main branch from 41a46a5 to c9c8b61 Compare July 8, 2024 10:12

JimmyZZZ closed this Jul 8, 2024

JimmyZZZ reopened this Jul 8, 2024

Jimmy Zhao added 4 commits July 8, 2024 22:01

[FLINK-35749] fix KafkaWriterItCase to cover real MailboxExecutor cas…

9821357

…e for kafka exception

[FLINK-35749] fix KafkaWriterItCase to cover real MailboxExecutor cas…

b824238

…e for kafka exception

[FLINK-35749] fix KafkaWriterItCase to cover real MailboxExecutor cas…

dfbf433

…e for kafka exception

[FLINK-35749] extract KafkaWriterTestBase and add fault tolerance tes…

384cd0b

…t class

JimmyZZZ requested a review from AHeise July 8, 2024 17:32

AHeise approved these changes Jul 9, 2024

View reviewed changes

AHeise requested a review from mas-chen July 9, 2024 07:58

[FLINK-35749] modify the default send retries to 0 to throw timeout e…

02e8ece

…xception more quickly

[FLINK-35749] adjust timeouts thresholds to throw timeout exception m…

013fd7b

…ore quickly

[FLINK-35749] add archunit violations for KafkaWriterFaultToleranceIT…

9ccd393

…Case

Jimmy Zhao added 2 commits July 10, 2024 19:07

[FLINK-35749] add archunit violations for KafkaWriterFaultToleranceIT…

077512c

…Case

[FLINK-35749] enhance KafkaWriterFaultToleranceITCase

643e8af

JimmyZZZ closed this Jul 11, 2024

JimmyZZZ reopened this Jul 11, 2024

AHeise merged commit 4429b78 into apache:main Jul 11, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-35749] Kafka sink component will lose data when kafka cluster is unavailable for a while #107

[FLINK-35749] Kafka sink component will lose data when kafka cluster is unavailable for a while #107

JimmyZZZ commented Jul 5, 2024 •

edited

Loading

boring-cyborg bot commented Jul 5, 2024

AHeise left a comment

AHeise Jul 5, 2024

JimmyZZZ Jul 5, 2024

JimmyZZZ Jul 8, 2024

mas-chen Jul 8, 2024

AHeise Jul 10, 2024

AHeise Jul 5, 2024

JimmyZZZ Jul 5, 2024

JimmyZZZ Jul 8, 2024

JimmyZZZ Jul 8, 2024

AHeise Jul 9, 2024 •

edited

Loading

JimmyZZZ Jul 9, 2024

AHeise commented Jul 5, 2024

JimmyZZZ commented Jul 8, 2024

AHeise left a comment

AHeise Jul 9, 2024

JimmyZZZ Jul 9, 2024

JimmyZZZ Jul 9, 2024

AHeise Jul 9, 2024

JimmyZZZ commented Jul 9, 2024

AHeise commented Jul 10, 2024 •

edited

Loading

JimmyZZZ commented Jul 10, 2024

AHeise commented Jul 10, 2024

JimmyZZZ commented Jul 11, 2024

AHeise commented Jul 11, 2024

JimmyZZZ commented Jul 11, 2024

AHeise commented Jul 11, 2024 •

edited

Loading

boring-cyborg bot commented Jul 11, 2024

[FLINK-35749] Kafka sink component will lose data when kafka cluster is unavailable for a while #107

[FLINK-35749] Kafka sink component will lose data when kafka cluster is unavailable for a while #107

Conversation

JimmyZZZ commented Jul 5, 2024 • edited Loading

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

boring-cyborg bot commented Jul 5, 2024

AHeise left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AHeise Jul 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AHeise commented Jul 5, 2024

JimmyZZZ commented Jul 8, 2024

AHeise left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JimmyZZZ commented Jul 9, 2024

AHeise commented Jul 10, 2024 • edited Loading

JimmyZZZ commented Jul 10, 2024

AHeise commented Jul 10, 2024

JimmyZZZ commented Jul 11, 2024

AHeise commented Jul 11, 2024

JimmyZZZ commented Jul 11, 2024

AHeise commented Jul 11, 2024 • edited Loading

boring-cyborg bot commented Jul 11, 2024

JimmyZZZ commented Jul 5, 2024 •

edited

Loading

AHeise Jul 9, 2024 •

edited

Loading

AHeise commented Jul 10, 2024 •

edited

Loading

AHeise commented Jul 11, 2024 •

edited

Loading