An update on NGINX TCP termination and questions about scaling and high availability features for EG #949

averageflamethrowerguy · 2021-03-27T00:13:14Z

averageflamethrowerguy
Mar 27, 2021

Hey Kevin,
I know it's been a ridiculously long time since I asked you about the Nginx TCP issue on EKS. I figured you may still be interested in the documentation of my solution.

This issue will come up if you're deploying the Kubernetes version of Jupyter Enterprise Gateway to AWS EKS. Standard REST operations will succeed, but if you try to create a websocket via wss://your_gateway_url/gateway/api/kernels/${kernelId}/channels, the operation will fail. This issue is very similar.

The TLDR of the issue was that the standard Nginx config sets the AWS load balancer to downgrade TCP requests to http within a cluster. However, EKS requires TCP requests in order to upgrade to websockets. (I don't claim to fully understand this, but changing this configuration does work). I attached a document sheet, which may be useful. Feel free to use it if it is.

EKS_Nginx_websockets.txt

Onto my question about scaling:
Recently I scaled up my EG cluster to support up to 300 users, and I started noticing some performance issues. GET requests and websocket creation on the cluster are taking ~5 seconds and ~15 seconds, respectively, when they were both nearly instant at smaller cluster sizes. My intuition was that this was an issue with the hub pod, but the problem persisted even when I dramatically increased the resources assigned to that pod. I'm wondering what scaling difficulties you've run into previously, and what number of pods EG is designed to support.
One potential solution that interested me is using multiple replicas of EG and routing traffic between them based on pod id within the REST request. I'm curious about whether you or anyone else on the team is working on something like this.

It turns out that EG has fit the project that I'm working on really well. You can check it out here if you're interested: https://qbook.qbraid.com. Thank you so much for building out such an awesome tool!

Answered by averageflamethrowerguy

May 7, 2021

Alright, I have a partial solution to the load balancing issue.
I decided that it would be worthwhile to simplify things and strip out the L7 NGINX load balancer entirely.
I simply exposed the deployment:

kubectl expose deployment enterprise-gateway --type=LoadBalancer --name=enterprise-gateway-entry --namespace=enterprise-gateway

I then added my SSL cert.

metadata:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-ssl-cert: MY_SSL_CERT_LINK

This resulted in some limited success. Sessions seem to be persisted through the kernel start and kernel retrieval operations. I've been seeing consistent failures on the Restart operations (but those no longer seem to be required in m…

View full answer

kevin-bates · 2021-03-27T18:05:48Z

kevin-bates
Mar 27, 2021
Maintainer

Hi Elliot - it's good to hear from you. Thank you for the detailed information regarding the AWS LB and EKS. I'm sure this will be useful to others.

Regarding scaling, we have not spent time directly with testing the various thresholds but have heard of installations pushing boundaries above 300 - although I'm not certain we have the details. It would be good to understand the GET and WebSocket issues a bit better. What version of EG and Notebook are you running against?

It's tough to have a discussion about HA without also involving disaster recovery (DR). Because we set sessionAffinity to ClientIP, which should route clients to the same EG instance, you should be able to increase replicas today. This would be a pure HA play. (Have you tried increasing the replicas?)

We also have an experimental kernel session pesistence option (--KernelSessionManager.enable_persistence=True) which will maintain active kernel session information in a file (presumably on a shared file system - with the idea a NOSQL DB be used for this) such that if one hosting EG goes down, another EG can be started at which time, the remote kernel sessions registered in the file will be reconnected from the newly started EG. This would be a pure DR play. (active/passive)

What still needs to happen to have both HA/DR is we need to intercept "kernel-not-found" exceptions in the kernel handler and go through the motions of checking the persisted kernel sessions, such that if the desired kernel id exists, then it should be loaded on that EG server and reconnected. This would allow for a more "active/active" behavior. However, in this mode, we would likely need to disable the previous DR functionality - where starting EG causes all of the persisted sessions to be reconnected. In this active/active approach, we would let the user trigger the reconnection by issuing requests against a remote kernel that the original EG instance cannot satisfy. Sine the new EG doesn't know about that kernel, it would reconnect on the fly.

Other details that would need to be worked out on this are how culling would work. Since the kernel's last activity (which is used to determine the kernel's idle period) is maintained in memory, then when a new EG takes over management, the culling period would reset. That isn't so bad since the kernel would be culled eventually. However, if the user never triggers the kernel's "hydration" on the new EG instance (i.e., the kernel was destined to be culled), then, it may never be culled and could result in orphaned kernel pods/processes.

One approach could be to also record the kernel's last activity in its persisted information, then when the new EG hydrates the kernel, the last activity maintained in memory could be used to seed the culling monitor. We could also have each EG instance look for culling candidates to address the cases where the user's action didn't hydrate the kernel. But again, that solution would require we update the kernel's persistent state on each activity and I'm not sure that's something we should do from a performance perspective.

We've talked about this kind of stuff on #562

I think adding the kernel-not-found code to check the persisted information would be straightforward and useful. Perhaps we could introduce some kind of DR-style option (active/active or active/passive) so the two flavors could co-exist. In active/active, EG would not load persisted sessions from the file, but let their "miss" trigger reconnection. In active/passive, newly started EGs would load persisted sessions. Since the idea is that there would never be more than one active EG, the "miss/reconnect" code could still exist since it should, in theory, never find a missing kernel id that is in the persisted state.

Sorry for the ramble, but I believe those are the current thoughts on where were are on the HA/DR front.

10 replies

kevin-bates Mar 28, 2021
Maintainer

Hi Elliot, is this something you'd be interested in contributing to? If so, it might be helpful to set up a Webex meeting where we can discuss things and draft up a more detailed design. My colleagues and I are in the Pacific timezone and typically find that afternoons work best.

averageflamethrowerguy Mar 28, 2021
Author

It certainly is!
I would suggest we meet sometime 2-6pm PDT on either Tuesday or Wednesday.

kevin-bates Mar 29, 2021
Maintainer

Hi Elliot, I've sent you a meeting invitation for Wednesday from 3 - 4 pm (PDT).

For others that might be interested, please feel free to join as well at: https://ibm.webex.com/meet/kevin.bates

We'll continue the discussion here focusing on these items:

Replicas and affinity
Active/Active vs. Active/Passive
Culling and orphan detection

kevin-bates Mar 31, 2021
Maintainer

Hi @averageflamethrowerguy - a last-minute change has cropped up for today. Any chance we could push our meeting to 4:30 (PDT) or tomorrow (Thursday, April 1) at 3:00 pm (PDT)? Sorry for the inconvenience.

kevin-bates Mar 31, 2021
Maintainer

UPDATE: Today's meeting has been moved to 4:30 (PDT).

averageflamethrowerguy · 2021-04-01T20:45:37Z

averageflamethrowerguy
Apr 1, 2021
Author

Starting a new subthread here about NGINX stuff and networking within Kubernetes / EG.
As I mentioned in the earlier subthread, I'm running into an issue with scaling multiple EG instances behind an NGINX proxy. The desired behavior is that a particular client is routed to the same EG instance upon subsequent HTTP requests (i.e, I can do a GET to get all active kernels, then a GET to get a particular kernel, and the second GET will always succeed because I'm on the same EG (which manages the same kernels).

Before you get buried in the details, I'd like to ask a few questions:

Do you know which header sessionAffinity: ClientIP looks at when trying to determine ClientIP?
Are there tools you've used before to analyze network traffic within a Kubernetes cluster? (Basically, I need to figure out what the headers transmitted between the Kubernetes ingress controller and the NodePort service are, and then what the headers downstream of that are).
Failing that, how would I look at the headers of the requests actually hitting EG (this could either be a debug flag, or you could point me in the direction of where your GET handler directly uses an HTTP library, and I'll modify that within an active image for now (rather than doing a rebuild).
If you got a multi-EG deployment to work somewhere else, could you tell me how you set that up (i.e, what cloud provider, were you using a LoadBalancer directly or pairing it with an ingress? It's easily possible that I'm running into an issue you already solved or dodged entirely.)
Is there any way you know to configure session affinity with a cookie within a Service?

So from my research, here's a description of the symptoms of the problem and the architectures involved.

The problem:
Using AWS EKS (I haven't tested with any other cloud provider) setting sessionAffinity: ClientIP within the selector of the service.yaml file doesn't produce sticky sessions as desired.
Instead, it looks like I'm routed to different EGs in a round-robin style approach. (I'm also routed RR when sessionAffinity: ClientIP is not specified.)

I started by investigating the NGINX LoadBalancer service/controller. In order to propagate information about the client IP address, my understanding is we need to enable proxy protocol both in the Load Balancer and in the ConfigMap.

kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "tcp"
    service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: "*"
spec:
  type: LoadBalancer
<additional configuration omitted for brevity>

apiVersion: v1
kind: ConfigMap
metadata:
  name: ingress-nginx-controller
  namespace: ingress-nginx
  labels:
      helm.sh/chart: ingress-nginx-3.10.1
      app.kubernetes.io/name: ingress-nginx
      app.kubernetes.io/instance: ingress-nginx
      app.kubernetes.io/version: 0.41.2
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/component: controller
data:
  http-snippet: |
    server {
      listen 2443;
      return 308 https://$host$request_uri;
    }
  real-ip-header: "proxy_protocol"
  proxy-read-timeout: "3600"
  proxy-send-timeout: "3600"
  set-real-ip-from: "0.0.0.0/0"
  proxy-real-ip-cidr: <omitted>
  use-forwarded-headers: "true"
  force-ssl-redirect: "false"

After doing this configuration step, I verified that both the X-Real-IP and X-Forwarded-For headers were being set. You can do this by increasing NGINX log level using this guide.
However, the problem still persists.

The next thing I tried was changing the affinity of the NGINX ingress itself. You can do this by specifying the following in values.yaml:

  nginx:
    enabled: true
    path: /gateway/?(.*)
    annotations:
      nginx.ingress.kubernetes.io/affinity: "cookie"
      nginx.ingress.kubernetes.io/session-cookie-expires: "172800"
      nginx.ingress.kubernetes.io/session-cookie-max-age: "172800"
      nginx.ingress.kubernetes.io/session-cookie-name: "route"
      nginx.ingress.kubernetes.io/affinity-mode: "persistent"

Unfortunately, as far as I can tell, NGINX only supports cookie affinity and Kubernetes Services only support ClientIP affinity.
NGINX will set a cookie on the client and use that for routing. My understanding is this routing happens at the Ingress level -- i.e, immediately behind the load balancer service.
Setting this had a very interesting effect. Alone, (i.e, without sessionAffinity: ClientIP) I get random successes and failures. For instance, if I'm doing a GET for a particular kernel, I will randomly succeed with probability ~1/$number_of_pods. This is in comparison to round-robin assignment, where I'd always succeed periodically (with 5 pods, I'd succeed the 5th try, 10th try, 15th try ..)
When paired with ClientIP affinity, I get consistent successes and failures within a particular request type. By that, I mean that all GETs to a particular kernel will succeed or fail. This is desired behavior. But if I did a POST to spin up a kernel and then a GET to find the kernel, the GET will generally fail (and fail consistently after that). (basically, it seems like the session is sticky within the same type of request). It's almost like I'm failing to forward the cookie or something when I switch request types...

The last piece of configuration I tried is related to routing within the Service created by service.yaml. My intuition is that setting an affinity at the Service level should be sufficient for sticky sessions. So I suspected that perhaps the headers weren't reaching the service.
Once possibility is actually at the service level itself. The Service declared in service.yaml is of type NodePort. NodePort services typically obscure clientIP information because they maintain their own routing tables for networking purposes. Headers downstream refer to the IP in the table.
Supposedly, you can get around this by setting

spec:
     externalTrafficPolicy: Local

but this ultimately didn't change any of the behavior I was experiencing. This isn't really surprising: because we are setting sessionAffinity: ClientIP within the same service, it is reasonable to expect the Service handles stickiness prior to obfuscating IP headers (though I'm by no means sure that this is what's going on).

Here's a diagram and some light reading if you are as confused as I was about Ingress vs Load Balancers.

9 replies

averageflamethrowerguy Apr 9, 2021
Author

Sweet! I appreciate you reaching out to help solve this!
Yes, messing with NGINX affinity was something I did before I had much of a grasp of the networking components involved. It's probably a red herring, but it changed the symptoms of the problem and was therefore interesting.

averageflamethrowerguy Apr 17, 2021
Author

Alright, I did some fiddling around with the logs tonight. Unfortunately, I wasn't able to get much information.
Here's what I've been working with (feel free to suggest a different workflow, because I feel like I'm doing this in an inefficient way ...)
I'm manually modifying an EG pod deployed to the EKS cluster in order to try to get logs out of it. I've tried modifying the kernelspec handlers in both /opt/conda/lib/python3.7/site-packages/enterprise_gateway/services/kernelspecs/handlers.py and /opt/conda/lib/python3.7/site-packages/notebook/services/kernelspecs/handlers.py. For both of these, I put the logs just within the get() handlers. (I also put a log just over one I knew was firing in /opt/conda/lib/python3.7/site-packages/jupyter_client/kernelspec.py).

None of these new logs ended up showing up in my debug logs.
Does the build step end up copying any of these files into a different directory that is then active?
I confess to being at a bit of a loss.

kevin-bates Apr 17, 2021
Maintainer

I put the logs just within the get() handlers. (I also put a log just over one I knew was firing in /opt/conda/lib/python3.7/site-packages/jupyter_client/kernelspec.py).

None of these new logs ended up showing up in my debug logs.

If you're editing the running pod instance of EG, how did you restart it (within the pod) in order to pick up the additional log statements? That might be tricky and you're probably better off just building an EG wheel file. Then building an image that derives from the one you're using except that it installs your wheel and deploy that image.

If you take that approach, you'll probably want to stick to handlers within EG since the others would require build and installation of the notebook (and/or jupyter_client) package(s) and you're then headed down that slippery slope. Choosing the kernelspecs handler is good since you can easily trigger its invocation. If you need to look at something other than GET then you'll want to look at enterprise_gateway/services/kernels/handlers.py.

averageflamethrowerguy Apr 17, 2021
Author

Ah, I totally didn't get that restarting something would be necessary!
Yeah, I'll try building the wheel.

averageflamethrowerguy Apr 21, 2021
Author

An interesting update here:
I got operational logging in the EG instance (the wheel building worked great, thank you!).
It looks like the IP address headers (X-Real-Ip, X-Forwarded-For) are both being forwarded correctly, all the way to the EG instance. But the ClientIP affinity still isn't taking hold.

Host: <ingress_dns_name>
    X-Request-Id: 3123812a54a59b4b180852194a1714b6
    X-Real-Ip: <my_ip>
    X-Forwarded-For: <my_ip>
    X-Forwarded-Host: <ingress_dns_name>
    X-Forwarded-Port: 443
    X-Forwarded-Proto: https
    X-Scheme: https
    User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:87.0) Gecko/20100101 Firefox/87.0
    ...

One possibility is the traffic is being routed independently of the headers. Certainly this would be the case for non-HTTP protocols, so it might make sense if this was the support system for all protocols.
Regardless, this is definitely a big step toward solving the issue because I don't have to worry about headers being changed anymore.

averageflamethrowerguy · 2021-05-07T03:31:50Z

averageflamethrowerguy
May 7, 2021
Author

Alright, I have a partial solution to the load balancing issue.
I decided that it would be worthwhile to simplify things and strip out the L7 NGINX load balancer entirely.
I simply exposed the deployment:

kubectl expose deployment enterprise-gateway --type=LoadBalancer --name=enterprise-gateway-entry --namespace=enterprise-gateway

I then added my SSL cert.

metadata:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-ssl-cert: MY_SSL_CERT_LINK

This resulted in some limited success. Sessions seem to be persisted through the kernel start and kernel retrieval operations. I've been seeing consistent failures on the Restart operations (but those no longer seem to be required in my series of kernel startup operations).
Then provisioning the websocket seems to be done randomly between all available pods. Unfortunately, I was unable to play around with more session stickiness, as I think I need a TCP backend connection to support websockets, and Amazon doesn't allow me to configure sticky sessions on a TCP backend.
I ended up just building some retry logic into my frontend to take care of getting the websocket (my guess is the Jupyter frontend probably does this by default).

Some other possible solutions that I haven't explored (or not deeply) but may be helpful to someone stumbling across this:

EG has a external IP configuration (k8sMasterPublicIP) in values.yaml of the helm chart. This sets an external IP address on the NodePort service that exposes the EG deployment by default. Using the NodePort for entry into the cluster may allow you to use sessionAffinity: ClientIP to maintain sticky sessions based on ClientIP, By avoiding using a load balancer, you may avoid obfuscating the sourceIP information that I think ClientIP uses for sticky sessions.
I ended up not doing this, because I couldn't figure out how to attach an external IP to the NodePort -- I saw something about needing to expose the node publicly, which I definitely did not want to do. It also seemed like directly attaching an IP was a less robust solution than a load balancer.
Use an Application Load Balancer (ALB) instead of a Classic Load Balancer. ALBs supposedly support websocket protocols natively. This might allow using HTTP/Websocket backend protocols rather than raw TCP. It's possible that this opens up more configuration options that allow better session persistence. I'm just even less familiar with the ALB side of things than the CLB side.

Also, for anyone interested in the Javascript implementation of my retry logic:

  // creates a websocket; in case of failure it just times 50 ms and calls itself again (tries up to 20 times).
  createWebsocket(kernelId, iterationNumber) {
    this.dispatch(kernelId, iterationNumber);
    var _this = this

    return new Promise(async (resolve, reject) => {
      var client = new WebSocketClient(`${WEBSOCKET_URL}/api/kernels/${kernelId}/channels`)

      client.onopen = function() {
        _this.actions._updateKernelState('Ready')
        _this.actions._storeWebsocket(client)
        console.log(`Websocket connected to kernel with id ${kernelId}`)
        resolve(client)
      }

      client.onerror = async function() {
        if (iterationNumber < NUMBER_RETRIES) {
          await new Promise((resolve) => setTimeout(resolve, 50))
          try {
            await _this.actions.createWebsocket(kernelId, iterationNumber + 1)
          }
          catch(err) {
            return reject(err)
          }

          return resolve()
        }
        else {
          _this.actions._updateKernelState('Failed')
          reject('Unable to connect to websocket')
        }
      }
    })
  }

Note that I'm using Alt to manage actions and state, but the basic business logic should be pretty obvious regardless of framework.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An update on NGINX TCP termination and questions about scaling and high availability features for EG #949

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 19 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

An update on NGINX TCP termination and questions about scaling and high availability features for EG #949

averageflamethrowerguy Mar 27, 2021

Replies: 3 comments · 19 replies

kevin-bates Mar 27, 2021 Maintainer

kevin-bates Mar 28, 2021 Maintainer

averageflamethrowerguy Mar 28, 2021 Author

kevin-bates Mar 29, 2021 Maintainer

kevin-bates Mar 31, 2021 Maintainer

kevin-bates Mar 31, 2021 Maintainer

averageflamethrowerguy Apr 1, 2021 Author

averageflamethrowerguy Apr 9, 2021 Author

averageflamethrowerguy Apr 17, 2021 Author

kevin-bates Apr 17, 2021 Maintainer

averageflamethrowerguy Apr 17, 2021 Author

averageflamethrowerguy Apr 21, 2021 Author

averageflamethrowerguy May 7, 2021 Author

averageflamethrowerguy
Mar 27, 2021

Replies: 3 comments 19 replies

kevin-bates
Mar 27, 2021
Maintainer

kevin-bates Mar 28, 2021
Maintainer

averageflamethrowerguy Mar 28, 2021
Author

kevin-bates Mar 29, 2021
Maintainer

kevin-bates Mar 31, 2021
Maintainer

kevin-bates Mar 31, 2021
Maintainer

averageflamethrowerguy
Apr 1, 2021
Author

averageflamethrowerguy Apr 9, 2021
Author

averageflamethrowerguy Apr 17, 2021
Author

kevin-bates Apr 17, 2021
Maintainer

averageflamethrowerguy Apr 17, 2021
Author

averageflamethrowerguy Apr 21, 2021
Author

averageflamethrowerguy
May 7, 2021
Author