An update on NGINX TCP termination and questions about scaling and high availability features for EG #949
-
Hey Kevin, This issue will come up if you're deploying the Kubernetes version of Jupyter Enterprise Gateway to AWS EKS. Standard REST operations will succeed, but if you try to create a websocket via wss://your_gateway_url/gateway/api/kernels/${kernelId}/channels, the operation will fail. This issue is very similar. The TLDR of the issue was that the standard Nginx config sets the AWS load balancer to downgrade TCP requests to http within a cluster. However, EKS requires TCP requests in order to upgrade to websockets. (I don't claim to fully understand this, but changing this configuration does work). I attached a document sheet, which may be useful. Feel free to use it if it is. Onto my question about scaling: It turns out that EG has fit the project that I'm working on really well. You can check it out here if you're interested: https://qbook.qbraid.com. Thank you so much for building out such an awesome tool! |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 19 replies
-
Hi Elliot - it's good to hear from you. Thank you for the detailed information regarding the AWS LB and EKS. I'm sure this will be useful to others. Regarding scaling, we have not spent time directly with testing the various thresholds but have heard of installations pushing boundaries above 300 - although I'm not certain we have the details. It would be good to understand the GET and WebSocket issues a bit better. What version of EG and Notebook are you running against? It's tough to have a discussion about HA without also involving disaster recovery (DR). Because we set We also have an experimental kernel session pesistence option ( What still needs to happen to have both HA/DR is we need to intercept "kernel-not-found" exceptions in the kernel handler and go through the motions of checking the persisted kernel sessions, such that if the desired kernel id exists, then it should be loaded on that EG server and reconnected. This would allow for a more "active/active" behavior. However, in this mode, we would likely need to disable the previous DR functionality - where starting EG causes all of the persisted sessions to be reconnected. In this active/active approach, we would let the user trigger the reconnection by issuing requests against a remote kernel that the original EG instance cannot satisfy. Sine the new EG doesn't know about that kernel, it would reconnect on the fly. Other details that would need to be worked out on this are how culling would work. Since the kernel's last activity (which is used to determine the kernel's idle period) is maintained in memory, then when a new EG takes over management, the culling period would reset. That isn't so bad since the kernel would be culled eventually. However, if the user never triggers the kernel's "hydration" on the new EG instance (i.e., the kernel was destined to be culled), then, it may never be culled and could result in orphaned kernel pods/processes. One approach could be to also record the kernel's last activity in its persisted information, then when the new EG hydrates the kernel, the last activity maintained in memory could be used to seed the culling monitor. We could also have each EG instance look for culling candidates to address the cases where the user's action didn't hydrate the kernel. But again, that solution would require we update the kernel's persistent state on each activity and I'm not sure that's something we should do from a performance perspective. We've talked about this kind of stuff on #562 I think adding the kernel-not-found code to check the persisted information would be straightforward and useful. Perhaps we could introduce some kind of DR-style option (active/active or active/passive) so the two flavors could co-exist. In active/active, EG would not load persisted sessions from the file, but let their "miss" trigger reconnection. In active/passive, newly started EGs would load persisted sessions. Since the idea is that there would never be more than one active EG, the "miss/reconnect" code could still exist since it should, in theory, never find a missing kernel id that is in the persisted state. Sorry for the ramble, but I believe those are the current thoughts on where were are on the HA/DR front. |
Beta Was this translation helpful? Give feedback.
-
Starting a new subthread here about NGINX stuff and networking within Kubernetes / EG. Before you get buried in the details, I'd like to ask a few questions:
So from my research, here's a description of the symptoms of the problem and the architectures involved. The problem: I started by investigating the NGINX LoadBalancer service/controller. In order to propagate information about the client IP address, my understanding is we need to enable proxy protocol both in the Load Balancer and in the ConfigMap.
After doing this configuration step, I verified that both the X-Real-IP and X-Forwarded-For headers were being set. You can do this by increasing NGINX log level using this guide. The next thing I tried was changing the affinity of the NGINX ingress itself. You can do this by specifying the following in values.yaml:
Unfortunately, as far as I can tell, NGINX only supports cookie affinity and Kubernetes Services only support ClientIP affinity. The last piece of configuration I tried is related to routing within the Service created by service.yaml. My intuition is that setting an affinity at the Service level should be sufficient for sticky sessions. So I suspected that perhaps the headers weren't reaching the service.
but this ultimately didn't change any of the behavior I was experiencing. This isn't really surprising: because we are setting Here's a diagram and some light reading if you are as confused as I was about Ingress vs Load Balancers. |
Beta Was this translation helpful? Give feedback.
-
Alright, I have a partial solution to the load balancing issue.
I then added my SSL cert.
This resulted in some limited success. Sessions seem to be persisted through the kernel start and kernel retrieval operations. I've been seeing consistent failures on the Restart operations (but those no longer seem to be required in my series of kernel startup operations). Some other possible solutions that I haven't explored (or not deeply) but may be helpful to someone stumbling across this:
Also, for anyone interested in the Javascript implementation of my retry logic:
Note that I'm using Alt to manage actions and state, but the basic business logic should be pretty obvious regardless of framework. |
Beta Was this translation helpful? Give feedback.
Alright, I have a partial solution to the load balancing issue.
I decided that it would be worthwhile to simplify things and strip out the L7 NGINX load balancer entirely.
I simply exposed the deployment:
I then added my SSL cert.
This resulted in some limited success. Sessions seem to be persisted through the kernel start and kernel retrieval operations. I've been seeing consistent failures on the Restart operations (but those no longer seem to be required in m…