-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AJ-1550: logging and health check for HybridConnectionListener #56
Conversation
hybridConnectionListener.setOfflineHandler( | ||
t -> logger.warn("HybridConnectionListener is now offline")); | ||
hybridConnectionListener.setOnlineHandler( | ||
() -> logger.warn("HybridConnectionListener is now online")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also have some custom counter metric emitted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea! It definitely blows up scope of this PR, though, since prometheus and metrics are not currently enabled in this codebase. If we wanted to enable prometheus/metrics, I'd suggest blowing up scope even further and upgrade to Spring Boot 3 first, since a bunch of metrics stuff changed between 2 and 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks convincing to me. Is there a good way to test it? Maybe we could simulate the case by deleting the hybrid connection for a valid listener, and verify the health check fails and k8s restarts the pod..
@@ -98,5 +98,8 @@ management: | |||
health: | |||
enabled: true | |||
probes.enabled: true | |||
group: | |||
liveness: | |||
include: livenessState,hybridConnectionListenerHealth |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the liveness probe currently returns the abbreviated {"status": "UP"}
or {"status": "DOWN"}
. We could add show-details: ALWAYS
here so it shows the individual statuses of each of the livenessState
and hybridConnectionListenerHealth
subsystems.
Quality Gate passedKudos, no new issues were introduced! 0 New issues |
"Liveness state should be CORRECT", | ||
applicationAvailability.getLivenessState(), | ||
equalTo(LivenessState.CORRECT)); | ||
ResultActions result = mvc.perform(get("/actuator/health/liveness")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is result
leftover/redundant code? It looks identical to livenessResult
but with a less specific name.
Jira Ticket: https://broadworkbench.atlassian.net/browse/AJ-1550
Summary of changes:
Adds a new
hybridConnectionListenerHealth
health check, via Actuator, to the k8s liveness probe. This health check echoes the state ofHybridConnectionListener.isOnline()
, which is part of theazure-relay
library.Why
It seems that if the listener runs into certain errors - specifically, we've seen problems with
java.util.concurrent.TimeoutException: DNS timeout 15000 ms
- the listener's connections hang and stop responding.This PR adds the listener's online state to the actuator health check. In theory, this would allow k8s to kill the listener pod and restart a new pod. This is drastic, but it could kick the listener connection back to a good place.
Testing strategy
HybridConnectionListener.isOnline()
affects the liveness probeHybridConnectionListener.isOnline()
is reliable and goes down in the use cases we are seeing. See this comment