Skip to content
This repository has been archived by the owner on Feb 8, 2024. It is now read-only.

implement liveness checks based on rdkafka health #41

Merged
merged 3 commits into from
Nov 7, 2023
Merged

Conversation

xvello
Copy link
Contributor

@xvello xvello commented Oct 30, 2023

While capture should be as resilient as possible to its dependencies being down (fail open if redis is down, keep trying to produce to kafka in case it comes back), we need to monitor the health of internal loops, and ensure the process is restarted in case these loops don't report anymore:

  • the HTTP server should be accepting new requests
  • the rdkafka producer thread must be running and reporting
  • we might add more asynchronous actors to the server soon, whose health we'll want to monitor

To address this, I propose a design where every component has to regularly report its health, and any component being unhealthy takes down the pod. Requiring frequent reporting before a given deadline enables us to catch cases where a loop is not executing anymore (either deadlocked or crashed).

I have seen systems trying to handle liveness and readiness together and fail to do either properly. My recommendation is to handle these two conditions orthogonally with two separate HealthRegistry:

  • for now, no registry for readiness, as the only component who should fail the readiness is the HTTP server. Just returning 200 from a no-op handler is enough
  • create a HealthRegistry for liveness, to allow the Kafka sink to report to it. We piggy-back on the stats reporting callback, that is called every 10 seconds

Logging output

image

Kafka sink

Process starts unhealthy:

< HTTP/1.1 500 Internal Server Error
< content-type: text/plain; charset=utf-8
< content-length: 66
< date: Mon, 30 Oct 2023 14:46:15 GMT
<
HealthStatus { healthy: false, components: {"rdkafka": Starting} }%

Once the rdkafka client loop is started and reporting metrics, it's healthy:

< HTTP/1.1 200 OK
< content-type: text/plain; charset=utf-8
< content-length: 107
< date: Mon, 30 Oct 2023 14:46:19 GMT
<
HealthStatus { healthy: true, components: {"rdkafka": HealthyUntil(2023-10-30 14:46:48.212735 +00:00:00)} }%

Always fail with print sink

Neither k8s nor the hobby deploy should have the print sink enabled, make sure the container fails:

< HTTP/1.1 500 Internal Server Error
< content-type: text/plain; charset=utf-8
< content-length: 70
< date: Mon, 30 Oct 2023 14:45:26 GMT

HealthStatus { healthy: false, components: {"print_sink": Unhealthy} }%

@xvello xvello requested a review from ellie October 30, 2023 15:47
@xvello xvello requested a review from ellie November 6, 2023 17:51
@xvello
Copy link
Contributor Author

xvello commented Nov 6, 2023

@ellie rebased on the recent changes + added logging:

image

@xvello xvello merged commit 780f390 into main Nov 7, 2023
@xvello xvello deleted the xvello/health branch November 7, 2023 12:53
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants