[FEATURE]EKS NodeGroups scalability #197

srinivasreddych · 2024-06-24T16:50:35Z

Is your feature request related to a problem? Please describe.
Related to modules/compute/eks

Describe the solution you'd like
The current manifests deploy EKS Managed node groups with desired count of atleast 1. Test if the workloads can scale with 0 as the starting capacity, so we can save $ for customers

The text was updated successfully, but these errors were encountered:

srinivasreddych · 2024-06-24T16:51:11Z

@a13zen Feel free to add more context about the ask here

a13zen · 2024-06-24T17:00:45Z

Testing by setting the desired/minimum to 0 sees the ASG terminating but then deploying 2 nodes again. This could be due to the base workload deployed by the EKS module

srinivasreddych · 2024-07-11T20:35:16Z

Hey @a13zen I was able to test the workflow and here is an update:

when a GPU (for example) nodegroup (NG) is requested via the eks module manifest, the user is expected to declare the labels. for ex: usage: gpu. The EKS module will add those labels to the NG and also add them as tags as described here, which is required by Cluster Autoscaler (CA) to scale from Zero.

k8s.io/cluster-autoscaler/node-template/label/usage: gpu

Expectation: when a user launches a GPU pod/job (for this context), the CA will query the tags on the GPU NG and scale out appropriately, thereby running the GPU pod/job. When the GPU NG is launched, it is expected behavior that aws-cni, kube-proxy, nvidia-device-plugin will be launched. Once the GPU pod functionality is executed, the EC2 instance will be terminated by CA.

Having said the above, i am thinking off a design where i would refactor the EKS module to launch a system NG with m5.large instance type always to accommodate drivers, system pods etc and let the user declare the required NGs as per requirement. Thoughts?

a13zen · 2024-07-17T18:59:39Z

Yes, having a simple system NG with small instances could be a good middle ground for sure. Do we know if m5.large would be sufficient for the default services deployed by the EKS module?

srinivasreddych · 2024-07-17T19:03:36Z

From my understanding, m5.large (2vCPU, 8gb RAM) should be sufficient, but depending on the number of plugins/drivers we/user deploys, the count of them should be >1. So starting with instance count = 2 should be a safe bet. Thoughts?

srinivasreddych added the enhancement New feature or request label Jun 24, 2024

srinivasreddych self-assigned this Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE]EKS NodeGroups scalability #197

[FEATURE]EKS NodeGroups scalability #197

srinivasreddych commented Jun 24, 2024

srinivasreddych commented Jun 24, 2024

a13zen commented Jun 24, 2024

srinivasreddych commented Jul 11, 2024

a13zen commented Jul 17, 2024

srinivasreddych commented Jul 17, 2024

[FEATURE]EKS NodeGroups scalability #197

[FEATURE]EKS NodeGroups scalability #197

Comments

srinivasreddych commented Jun 24, 2024

srinivasreddych commented Jun 24, 2024

a13zen commented Jun 24, 2024

srinivasreddych commented Jul 11, 2024

a13zen commented Jul 17, 2024

srinivasreddych commented Jul 17, 2024