Various gRPC issues with Dagster deployed using k8s with Helm #27266

mmutso-boku · 2025-01-22T15:30:00Z

Background

I have deployed Dagster in k8s with Helm. There are 7 code locations and an additional read-only webserver - so in total 10 persistent pods.
I have been encountering different gRPC-related issues, which I have tried to debug and understand, but I have very little experience with RPC, and there isn't much described in the Dagster documentation about how it is implemented, so I feel there is still a large gap in understanding what is happening.

I considered adding a comment to this issue #25116 but not 100% sure my problems are the same.

Scenario 1

One of the code location pods, seemingly randomly, loses connection to its grpc server - the readiness check fails and stays failing for many hours straight. After 7 hours, I manually restarted the k8s deployment, which created a new pod and killed the old one, and the issue was resolved - it could again connect to grpc server. This happens very intermittently - once a week or once every two weeks, but there has been cases where it has happened with only a day inbetween.

Scenario 2

As a temporary bandaid fix to the above scenario, I though about configuring k8s liveness probes so that basically if there is a sustained readiness probe fail, it would restart automatically, as that seemed to help when done manually, however, when testing these in my dev k8s cluster and dagster deployment, I observed that some code location pods now start getting OOMKilled.
After a lot of investigation, I managed to pinpoint it (and replicate it) to concurrenct grpc requests (I think).

What I did to replicate it manually - I went into the pod with kubectl exec and manually executed the health check that readiness and liveness probes execute: dagster api grpc-health-check -p 3010.

With a single execution, it usually finished in 2-8 seconds, but sometimes when timed right it would fail with the usual

<_InactiveRpcError of RPC that terminated with:
           status = StatusCode.UNAVAILABLE
           details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:3010: Failed to connect to remote host: connect: Connection refused (111)"
           debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:3010: Failed to connect to remote host: connect: Connection refused (111)", grpc_status:14

and then the container restarted due to OOMKilled.
After bumping the memory requests and limits 3x, I was then able to manually run the health check command concurrently multiple times more with time dagster api grpc-health-check -p 3010 & time dagster api grpc-health-check -p 3010 & ... &.

So it seems that when running these health checks in parallel, a large memory spike happens depending on how many requests there are in parallel.

Questions

Any ideas what could cause the situation in Scenario 1 where the pod is in Running state but not Ready for a prolonged amount of time?
Why does the memory spike happen when there are parallel grpc requests? Should the max_workers value be changed to something smaller than the default?
Is there a more detailed explanation of the whole Dagster gRPC setup? As in for example, are the gRPC servers on the code location pods or in the daemon? What connects to where when the health check is executed? What exactly runs in the code location pod and what runs in the daemon pod in regards to sensors and schedules? etc..

The text was updated successfully, but these errors were encountered:

garethbrickman added deployment: k8s Related to deploying Dagster to Kubernetes type: troubleshooting Related to debugging and error messages labels Jan 23, 2025

garethbrickman changed the title ~~Various gRPC issues~~ Various gRPC issues with Dagster deployed using k8s with Helm Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various gRPC issues with Dagster deployed using k8s with Helm #27266

Various gRPC issues with Dagster deployed using k8s with Helm #27266

mmutso-boku commented Jan 22, 2025

Various gRPC issues with Dagster deployed using k8s with Helm #27266

Various gRPC issues with Dagster deployed using k8s with Helm #27266

Comments

mmutso-boku commented Jan 22, 2025

Background

Scenario 1

Scenario 2

Questions