Skip to content

Fix service discovery bug in kubernetes-extensions#19139

Merged
gianm merged 16 commits intoapache:masterfrom
capistrant:k8s-discovery-bug
Mar 18, 2026
Merged

Fix service discovery bug in kubernetes-extensions#19139
gianm merged 16 commits intoapache:masterfrom
capistrant:k8s-discovery-bug

Conversation

@capistrant
Copy link
Contributor

@capistrant capistrant commented Mar 11, 2026

Description

Bug Report

The k8s service discovery is not removing discovered nodes whose pods still exist with service announcement labels, but the underlying services are actually unhealthy.

For example, if a broker container is killed but the pod that manages it remains in the namespace with announcement labels, all druid services will maintain this service in their discovered services cache. This leads to queries being routed to a broker that cannot possibly execute the request. If this pod remains in an announced but unhealthy state for any meaningful period of time, the cluster functionality can be severely compromised.

Desired behavior in the above example would be that the broker is removed from discovered services caches, at least until the underlying container for the pod is restarted and the pod is healthy again.

Fix Details

My proposed fix starts using a pods readiness flag in the discovery logic. If a pod is not ready, the underlying services will not be added to service discovery caches they are not in and will be removed from any caches that they were in. These services can be added back once they have a MODIFIED or ADDED event in addition to being ready again.

Fix Risks

The biggest risk I see is that this new reliance on readiness probe introduces an expectation that this probe is accurate and stable. I try to call out in documentation that this needs to be considered when defining the readiness probe for a pod as a way to mitigate unexpected changes for users. This could be included in a release note as well to tip off any users of the extension.

Alternatives

Using Pod Phase is another option for this. This feels like a more permissive check than readiness in that it may be more lax with removing a service from discovered hosts than readiness.

The idea would be to treat Running and Unknown pod phase states as discoverable. Unknown is there to avoid k8s control plane communication issues / transient problems from removing hosts from discovered lists. One downside here based on k8s documentation (linked above) is that a pod will be in running phase if a container is in the process of starting or restarting. This technically means that services who are not actually ready to be in the cluster could still be discovered if they have the labels and are in a transient state trying to start. The fear here being that a service gets into a restart loop without changing phase, so it remains discoverable but is not going to become healthy in the near future (or ever).

Release note

TBD


Key changed/added classes in this PR
  • DefaultK8sApiClient
  • BaseNodeRoleWatcher
  • WatchResult

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@capistrant
Copy link
Contributor Author

marking this as draft while I evaluate a competing approach that uses pod phase instead of readiness

@capistrant capistrant marked this pull request as draft March 11, 2026 22:32
@capistrant capistrant marked this pull request as ready for review March 17, 2026 22:31
@capistrant capistrant added this to the 37.0.0 milestone Mar 17, 2026
@capistrant
Copy link
Contributor Author

marking this as draft while I evaluate a competing approach that uses pod phase instead of readiness

I think that I prefer readiness to pod phase.

being ready means that you are ready to announce yourself to the cluster. If a readiness probe is properly configured using our recommendations, if the probe fails enough to cross threshold of being not ready, the service is no longer ready to be announced and should stop getting traffic.

also in protecting against the worst case of a pod having announce labels but the container having been killed without proper shutdown + not being rescheduled, readiness will properly handle undiscovering such a service nearly immediately upon crash.

effectiveType = WatchResult.NOT_READY;
} else if (WatchResult.ADDED.equals(item.type)) {
// Pod is not ready yet (e.g., still starting up). Skip this event entirely.
// It will appear via a MODIFIED event that remaps to ADDED for dicovery, once it becomes ready.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discovery (spelling)

@@ -26,9 +26,18 @@
public interface WatchResult
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class used to be a relatively thin layer over k8s watches, now there's remapping and synthetic events going on. The javadoc of this class should make clear that it's not meant to be a thin layer over k8s watches, it's meant to be aligned with the needs of service discovery.

@@ -273,6 +273,11 @@ private void keepWatching(String labelSelector, String resourceVersion)
case WatchResult.DELETED:
baseNodeRoleWatcher.childRemoved(item.object.getNode());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can DELETED fire after NOT_READY (or even before ADDED, if a pod is deleted before it ever becomes ready)? I think the logging in those cases will end up noisy, since childRemoved complains loudly if the service doesn't exist. You added a skipIfUnknown that is used when NOT_READY fires, but I wonder if now we should always act like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, these could certainly happen. thank you for calling it out

@gianm gianm merged commit 15e98ba into apache:master Mar 18, 2026
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants