implement liveness checks based on rdkafka health #41

xvello · 2023-10-30T15:46:56Z

While capture should be as resilient as possible to its dependencies being down (fail open if redis is down, keep trying to produce to kafka in case it comes back), we need to monitor the health of internal loops, and ensure the process is restarted in case these loops don't report anymore:

the HTTP server should be accepting new requests
the rdkafka producer thread must be running and reporting
we might add more asynchronous actors to the server soon, whose health we'll want to monitor

To address this, I propose a design where every component has to regularly report its health, and any component being unhealthy takes down the pod. Requiring frequent reporting before a given deadline enables us to catch cases where a loop is not executing anymore (either deadlocked or crashed).

I have seen systems trying to handle liveness and readiness together and fail to do either properly. My recommendation is to handle these two conditions orthogonally with two separate HealthRegistry:

for now, no registry for readiness, as the only component who should fail the readiness is the HTTP server. Just returning 200 from a no-op handler is enough
create a HealthRegistry for liveness, to allow the Kafka sink to report to it. We piggy-back on the stats reporting callback, that is called every 10 seconds

Logging output

Kafka sink

Process starts unhealthy:

< HTTP/1.1 500 Internal Server Error
< content-type: text/plain; charset=utf-8
< content-length: 66
< date: Mon, 30 Oct 2023 14:46:15 GMT
<
HealthStatus { healthy: false, components: {"rdkafka": Starting} }%

Once the rdkafka client loop is started and reporting metrics, it's healthy:

< HTTP/1.1 200 OK
< content-type: text/plain; charset=utf-8
< content-length: 107
< date: Mon, 30 Oct 2023 14:46:19 GMT
<
HealthStatus { healthy: true, components: {"rdkafka": HealthyUntil(2023-10-30 14:46:48.212735 +00:00:00)} }%

Always fail with print sink

Neither k8s nor the hobby deploy should have the print sink enabled, make sure the container fails:

< HTTP/1.1 500 Internal Server Error
< content-type: text/plain; charset=utf-8
< content-length: 70
< date: Mon, 30 Oct 2023 14:45:26 GMT

HealthStatus { healthy: false, components: {"print_sink": Unhealthy} }%

xvello · 2023-11-06T18:04:57Z

@ellie rebased on the recent changes + added logging:

xvello requested a review from ellie October 30, 2023 15:47

ellie approved these changes Nov 2, 2023

View reviewed changes

implement liveness checks based on rdkafka health

fe12513

xvello force-pushed the xvello/health branch from b68ecf4 to fe12513 Compare November 6, 2023 17:47

xvello requested a review from ellie November 6, 2023 17:51

log health check result

897fb90

fix test

de443cf

xvello merged commit 780f390 into main Nov 7, 2023

xvello deleted the xvello/health branch November 7, 2023 12:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement liveness checks based on rdkafka health #41

implement liveness checks based on rdkafka health #41

xvello commented Oct 30, 2023 •

edited

Loading

xvello commented Nov 6, 2023

implement liveness checks based on rdkafka health #41

implement liveness checks based on rdkafka health #41

Conversation

xvello commented Oct 30, 2023 • edited Loading

Logging output

Kafka sink

Always fail with print sink

xvello commented Nov 6, 2023

xvello commented Oct 30, 2023 •

edited

Loading