Various gRPC issues with Dagster deployed using k8s with Helm #27266
Labels
deployment: k8s
Related to deploying Dagster to Kubernetes
type: troubleshooting
Related to debugging and error messages
Background
I have deployed Dagster in k8s with Helm. There are 7 code locations and an additional read-only webserver - so in total 10 persistent pods.
I have been encountering different gRPC-related issues, which I have tried to debug and understand, but I have very little experience with RPC, and there isn't much described in the Dagster documentation about how it is implemented, so I feel there is still a large gap in understanding what is happening.
I considered adding a comment to this issue #25116 but not 100% sure my problems are the same.
Scenario 1
One of the code location pods, seemingly randomly, loses connection to its grpc server - the readiness check fails and stays failing for many hours straight. After 7 hours, I manually restarted the k8s deployment, which created a new pod and killed the old one, and the issue was resolved - it could again connect to grpc server. This happens very intermittently - once a week or once every two weeks, but there has been cases where it has happened with only a day inbetween.
Scenario 2
As a temporary bandaid fix to the above scenario, I though about configuring k8s liveness probes so that basically if there is a sustained readiness probe fail, it would restart automatically, as that seemed to help when done manually, however, when testing these in my dev k8s cluster and dagster deployment, I observed that some code location pods now start getting OOMKilled.
After a lot of investigation, I managed to pinpoint it (and replicate it) to concurrenct grpc requests (I think).
What I did to replicate it manually - I went into the pod with
kubectl exec
and manually executed the health check that readiness and liveness probes execute:dagster api grpc-health-check -p 3010
.With a single execution, it usually finished in 2-8 seconds, but sometimes when timed right it would fail with the usual
and then the container restarted due to OOMKilled.
After bumping the memory requests and limits 3x, I was then able to manually run the health check command concurrently multiple times more with
time dagster api grpc-health-check -p 3010 & time dagster api grpc-health-check -p 3010 & ... &
.So it seems that when running these health checks in parallel, a large memory spike happens depending on how many requests there are in parallel.
Questions
max_workers
value be changed to something smaller than the default?The text was updated successfully, but these errors were encountered: