Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: exceptions in KafkaSqlStore do not trigger liveness checking #6021 #6037

Open
wants to merge 2 commits into
base: 2.6.x
Choose a base branch
from

Conversation

jsenko
Copy link
Member

@jsenko jsenko commented Mar 12, 2025

We have come across a case where after Registry (KafkaSQL) pod has started, h2 SQL exceptions were logged (probably because of topic corruption). The liveness check is expected to prevent the pod from running, but in this case the check was not triggered.

  1. The initial hypothesis was that the KafkaSqlStore h2 DB was missing a liveness interceptor. However, I found out that the exception was intentionally ignored. Since this might hide problems, I've created a new exception type and changed the code so it is thrown.

  2. However, that change does not address the liveness part. Currently, all exceptions occurring when reading the topic are handled in KafkaSqlSink. There are two cases:

    1. The exception is caused by a message our node sent and should be handled by the HTTP thread (that code path includes liveness interceptor on KafkaSqlRegistryStorage)
    2. The exception is caused by a message from another node and is currently ignored. Instead of adding the interceptor to KafkaSqlSink, I've decided to add code here that triggers the liveness check under specific circumstances (see comment).
  3. I also found out that despite liveness should be triggered in case 2 ii., that will happen only if the HTTP thread waits for the response, which was not done for import operations. I've fixed this as well.

  4. However, because there is no guarantee that an HTTP thread will wait on the response, we actually have a slow memory leak in KafkaSqlCoordinator.

  5. To address 3 and 4, I've added a code that would check for cases where the HTTP thread did not handle a return value or an exception, clean up, and determine if liveness check should be triggered.

I'll create a separate issue to investigate this for 3.x.

@jsenko
Copy link
Member Author

jsenko commented Mar 12, 2025

The test is failing because of an exception that was previously ignored:

Error:  Errors: 
Error:    MigrationTest.migrateData:36 » RestClient Import of entity 'io.apicurio.registry.utils.impexp.GroupEntity@4515c79' has failed: io.apicurio.registry.storage.impl.sql.jdb.RuntimeSqlException: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "groups_pkey"
  Detail: Key (tenantid, groupid)=(_, avro-schemas) already exists.
[INFO] 
Error:  Tests run: 342, Failures: 0, Errors: 1, Skipped: 2

How do we want to handle this? We cannot ignore import errors and fail liveness check at the same time. This might indicate some bug in the data importer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant