Fix service discovery bug in kubernetes-extensions#19139
Fix service discovery bug in kubernetes-extensions#19139gianm merged 16 commits intoapache:masterfrom
kubernetes-extensions#19139Conversation
|
marking this as draft while I evaluate a competing approach that uses pod phase instead of readiness |
I think that I prefer readiness to pod phase. being ready means that you are ready to announce yourself to the cluster. If a readiness probe is properly configured using our recommendations, if the probe fails enough to cross threshold of being not ready, the service is no longer ready to be announced and should stop getting traffic. also in protecting against the worst case of a pod having announce labels but the container having been killed without proper shutdown + not being rescheduled, readiness will properly handle undiscovering such a service nearly immediately upon crash. |
| effectiveType = WatchResult.NOT_READY; | ||
| } else if (WatchResult.ADDED.equals(item.type)) { | ||
| // Pod is not ready yet (e.g., still starting up). Skip this event entirely. | ||
| // It will appear via a MODIFIED event that remaps to ADDED for dicovery, once it becomes ready. |
| @@ -26,9 +26,18 @@ | |||
| public interface WatchResult | |||
There was a problem hiding this comment.
This class used to be a relatively thin layer over k8s watches, now there's remapping and synthetic events going on. The javadoc of this class should make clear that it's not meant to be a thin layer over k8s watches, it's meant to be aligned with the needs of service discovery.
| @@ -273,6 +273,11 @@ private void keepWatching(String labelSelector, String resourceVersion) | |||
| case WatchResult.DELETED: | |||
| baseNodeRoleWatcher.childRemoved(item.object.getNode()); | |||
There was a problem hiding this comment.
Can DELETED fire after NOT_READY (or even before ADDED, if a pod is deleted before it ever becomes ready)? I think the logging in those cases will end up noisy, since childRemoved complains loudly if the service doesn't exist. You added a skipIfUnknown that is used when NOT_READY fires, but I wonder if now we should always act like that.
There was a problem hiding this comment.
yes, these could certainly happen. thank you for calling it out
Description
Bug Report
The k8s service discovery is not removing discovered nodes whose pods still exist with service announcement labels, but the underlying services are actually unhealthy.
For example, if a broker container is killed but the pod that manages it remains in the namespace with announcement labels, all druid services will maintain this service in their discovered services cache. This leads to queries being routed to a broker that cannot possibly execute the request. If this pod remains in an announced but unhealthy state for any meaningful period of time, the cluster functionality can be severely compromised.
Desired behavior in the above example would be that the broker is removed from discovered services caches, at least until the underlying container for the pod is restarted and the pod is healthy again.
Fix Details
My proposed fix starts using a pods readiness flag in the discovery logic. If a pod is not ready, the underlying services will not be added to service discovery caches they are not in and will be removed from any caches that they were in. These services can be added back once they have a MODIFIED or ADDED event in addition to being ready again.
Fix Risks
The biggest risk I see is that this new reliance on readiness probe introduces an expectation that this probe is accurate and stable. I try to call out in documentation that this needs to be considered when defining the readiness probe for a pod as a way to mitigate unexpected changes for users. This could be included in a release note as well to tip off any users of the extension.
Alternatives
Using Pod Phase is another option for this. This feels like a more permissive check than readiness in that it may be more lax with removing a service from discovered hosts than readiness.
The idea would be to treat
RunningandUnknownpod phase states as discoverable. Unknown is there to avoid k8s control plane communication issues / transient problems from removing hosts from discovered lists. One downside here based on k8s documentation (linked above) is that a pod will be in running phase if a container is in the process of starting or restarting. This technically means that services who are not actually ready to be in the cluster could still be discovered if they have the labels and are in a transient state trying to start. The fear here being that a service gets into a restart loop without changing phase, so it remains discoverable but is not going to become healthy in the near future (or ever).Release note
TBD
Key changed/added classes in this PR
DefaultK8sApiClientBaseNodeRoleWatcherWatchResultThis PR has: