Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various gRPC issues with Dagster deployed using k8s with Helm #27266

Open
mmutso-boku opened this issue Jan 22, 2025 · 0 comments
Open

Various gRPC issues with Dagster deployed using k8s with Helm #27266

mmutso-boku opened this issue Jan 22, 2025 · 0 comments
Labels
deployment: k8s Related to deploying Dagster to Kubernetes type: troubleshooting Related to debugging and error messages

Comments

@mmutso-boku
Copy link

Background

I have deployed Dagster in k8s with Helm. There are 7 code locations and an additional read-only webserver - so in total 10 persistent pods.
I have been encountering different gRPC-related issues, which I have tried to debug and understand, but I have very little experience with RPC, and there isn't much described in the Dagster documentation about how it is implemented, so I feel there is still a large gap in understanding what is happening.

I considered adding a comment to this issue #25116 but not 100% sure my problems are the same.

Scenario 1

One of the code location pods, seemingly randomly, loses connection to its grpc server - the readiness check fails and stays failing for many hours straight. After 7 hours, I manually restarted the k8s deployment, which created a new pod and killed the old one, and the issue was resolved - it could again connect to grpc server. This happens very intermittently - once a week or once every two weeks, but there has been cases where it has happened with only a day inbetween.

Scenario 2

As a temporary bandaid fix to the above scenario, I though about configuring k8s liveness probes so that basically if there is a sustained readiness probe fail, it would restart automatically, as that seemed to help when done manually, however, when testing these in my dev k8s cluster and dagster deployment, I observed that some code location pods now start getting OOMKilled.
After a lot of investigation, I managed to pinpoint it (and replicate it) to concurrenct grpc requests (I think).

What I did to replicate it manually - I went into the pod with kubectl exec and manually executed the health check that readiness and liveness probes execute: dagster api grpc-health-check -p 3010.

With a single execution, it usually finished in 2-8 seconds, but sometimes when timed right it would fail with the usual

<_InactiveRpcError of RPC that terminated with:
           status = StatusCode.UNAVAILABLE
           details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:3010: Failed to connect to remote host: connect: Connection refused (111)"
           debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:3010: Failed to connect to remote host: connect: Connection refused (111)", grpc_status:14

and then the container restarted due to OOMKilled.
After bumping the memory requests and limits 3x, I was then able to manually run the health check command concurrently multiple times more with time dagster api grpc-health-check -p 3010 & time dagster api grpc-health-check -p 3010 & ... &.

So it seems that when running these health checks in parallel, a large memory spike happens depending on how many requests there are in parallel.

Questions

  1. Any ideas what could cause the situation in Scenario 1 where the pod is in Running state but not Ready for a prolonged amount of time?
  2. Why does the memory spike happen when there are parallel grpc requests? Should the max_workers value be changed to something smaller than the default?
  3. Is there a more detailed explanation of the whole Dagster gRPC setup? As in for example, are the gRPC servers on the code location pods or in the daemon? What connects to where when the health check is executed? What exactly runs in the code location pod and what runs in the daemon pod in regards to sensors and schedules? etc..
@garethbrickman garethbrickman added deployment: k8s Related to deploying Dagster to Kubernetes type: troubleshooting Related to debugging and error messages labels Jan 23, 2025
@garethbrickman garethbrickman changed the title Various gRPC issues Various gRPC issues with Dagster deployed using k8s with Helm Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployment: k8s Related to deploying Dagster to Kubernetes type: troubleshooting Related to debugging and error messages
Projects
None yet
Development

No branches or pull requests

2 participants