Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AJ-1550: logging and health check for HybridConnectionListener #56

Merged
merged 4 commits into from
Jan 22, 2024

Conversation

davidangb
Copy link
Contributor

@davidangb davidangb commented Jan 19, 2024

Jira Ticket: https://broadworkbench.atlassian.net/browse/AJ-1550

Summary of changes:

Adds a new hybridConnectionListenerHealth health check, via Actuator, to the k8s liveness probe. This health check echoes the state of HybridConnectionListener.isOnline(), which is part of the azure-relay library.

Why

It seems that if the listener runs into certain errors - specifically, we've seen problems with java.util.concurrent.TimeoutException: DNS timeout 15000 ms - the listener's connections hang and stop responding.

This PR adds the listener's online state to the actuator health check. In theory, this would allow k8s to kill the listener pod and restart a new pod. This is drastic, but it could kick the listener connection back to a good place.

Testing strategy

  • Added a unit test that verifies HybridConnectionListener.isOnline() affects the liveness probe
  • Needs manual testing, in a real AKS environment, to see if HybridConnectionListener.isOnline() is reliable and goes down in the use cases we are seeing. See this comment

hybridConnectionListener.setOfflineHandler(
t -> logger.warn("HybridConnectionListener is now offline"));
hybridConnectionListener.setOnlineHandler(
() -> logger.warn("HybridConnectionListener is now online"));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also have some custom counter metric emitted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea! It definitely blows up scope of this PR, though, since prometheus and metrics are not currently enabled in this codebase. If we wanted to enable prometheus/metrics, I'd suggest blowing up scope even further and upgrade to Spring Boot 3 first, since a bunch of metrics stuff changed between 2 and 3.

Copy link
Contributor

@rtitle rtitle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks convincing to me. Is there a good way to test it? Maybe we could simulate the case by deleting the hybrid connection for a valid listener, and verify the health check fails and k8s restarts the pod..

@@ -98,5 +98,8 @@ management:
health:
enabled: true
probes.enabled: true
group:
liveness:
include: livenessState,hybridConnectionListenerHealth
Copy link
Contributor Author

@davidangb davidangb Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the liveness probe currently returns the abbreviated {"status": "UP"} or {"status": "DOWN"}. We could add show-details: ALWAYS here so it shows the individual statuses of each of the livenessState and hybridConnectionListenerHealth subsystems.

@davidangb davidangb marked this pull request as ready for review January 22, 2024 17:21
@davidangb davidangb changed the title DRAFT, NO TICKET YET: logging and health check for HybridConnectionListener AJ-1550: logging and health check for HybridConnectionListener Jan 22, 2024
Copy link

@davidangb davidangb merged commit a90bb04 into main Jan 22, 2024
4 checks passed
@davidangb davidangb deleted the da_HybridConnectionListenerHealth branch January 22, 2024 20:07
"Liveness state should be CORRECT",
applicationAvailability.getLivenessState(),
equalTo(LivenessState.CORRECT));
ResultActions result = mvc.perform(get("/actuator/health/liveness"));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is result leftover/redundant code? It looks identical to livenessResult but with a less specific name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants