Add health check endpoint #395

pondzix · 2025-01-29T07:07:31Z

PDP-1557

For now the only source of 'health' is a source, e.g. in this draft for kinesis:

We keep track of events currently being processed downstream. If there is any message in memory that hasn't been acked for a while (exceeding certain configured threshold) == unhealthy.
We check if underlying kinsumer client attempts to fetch records from Kinesis. Even we have no records on input - it's fine, we just have to know if kinsumer is not stuck for some unknown reason. If there is no fetch coming from kinsumer in a while (exceeding certain configured threshold) == unhealthy.

Health is exposed through /health HTTP endpoint.

For now the only source of 'health' is a source, e.g. in this draft for kinesis: * We keep track of events currently being processed downstream. If there is any message in memory that hasn't been acked for a while (exceeding certain configured threshold) == unhealthy. * We check if underlying kinsumer client attempts to fetch records from Kinesis. Even we have no records on input - it's fine, we just have to know if kinsumer is not stuck for some unknown reason. If there is no fetch coming from kinsumer in a while (exceeding certain configured threshold) == unhealthy. Health is exposed through `/health` HTTP endpoint.

istreeter

In your implementation, two different things contribute to bad health:

Problems receiving events from the external source
Events stuck in the app, i.e. a processing problem or sink problem.

For 1, it completely makes sense to implement it in the source.

But for 2... isn't this a helpful health check for all sources? You could implement 2 exactly the same in the pubsub source and kafka source. But then you would be repeating the same code in all sources.

In other words, is there any part of this that can be moved out of the source?

I appreciate this is a difficult problem! Because I have struggled with these same questions in common-streams.

istreeter · 2025-02-03T09:21:01Z

cmd/cli/cli.go

@@ -340,3 +342,18 @@ func exitWithError(err error, flushSentry bool) {
 	}
 	os.Exit(1)
 }
+
+func runHealthServer(source sourceiface.Source) {


Sharing this just for interest:

In common-streams the health probe actually responds to all requests with a ok. Not just requests to the /health endpoint.

That might be a mistake in common-streams: time will tell! I did it because it felt strange to hard-code the completely arbitrary string /health.

istreeter · 2025-02-03T09:41:48Z

pkg/source/kinesis/kinesis_source.go

-
-	log *log.Entry
+	statsReceiver    *kinsumerActivityRecorder
+	unackedMsgs      map[string]int64


Did you consider making this map[uuid.UUID]int64? It looks like the keys are always stringified UUIDs.

istreeter · 2025-02-03T10:07:32Z

pkg/source/kinesis/kinesis_source.go

 			checkpointer()
+			ks.removeUnacked(randomUUID)


How deeply to you understand what checkpointer() does? Does it block until this record is actually checkpointed to dynamodb? Or does it return immediately, so kinsumer can checkpoint it later?

From conversations I've had with others, I think it might be the latter.

If it fails to checkpoint later, then what does kinsumer do next? Does it stop calling the EventsFromKinesis function?

Does any of this matter? I think possibly no.... because I think your health checkpoint endpoint works correctly anyway. But it's worth thinking about.

pondzix force-pushed the spike/health_checking branch from 5108ac9 to 77c8e17 Compare January 29, 2025 07:09

pondzix requested a review from colmsnowplow January 29, 2025 17:12

istreeter reviewed Feb 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add health check endpoint #395

Add health check endpoint #395

pondzix commented Jan 29, 2025 •

edited

Loading

istreeter left a comment

istreeter Feb 3, 2025

istreeter Feb 3, 2025

istreeter Feb 3, 2025

Add health check endpoint #395

Are you sure you want to change the base?

Add health check endpoint #395

Conversation

pondzix commented Jan 29, 2025 • edited Loading

istreeter left a comment

Choose a reason for hiding this comment

istreeter Feb 3, 2025

Choose a reason for hiding this comment

istreeter Feb 3, 2025

Choose a reason for hiding this comment

istreeter Feb 3, 2025

Choose a reason for hiding this comment

pondzix commented Jan 29, 2025 •

edited

Loading