Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]EKS NodeGroups scalability #197

Open
srinivasreddych opened this issue Jun 24, 2024 · 5 comments
Open

[FEATURE]EKS NodeGroups scalability #197

srinivasreddych opened this issue Jun 24, 2024 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@srinivasreddych
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Related to modules/compute/eks

Describe the solution you'd like
The current manifests deploy EKS Managed node groups with desired count of atleast 1. Test if the workloads can scale with 0 as the starting capacity, so we can save $ for customers

@srinivasreddych srinivasreddych added the enhancement New feature or request label Jun 24, 2024
@srinivasreddych
Copy link
Contributor Author

@a13zen Feel free to add more context about the ask here

@a13zen
Copy link
Contributor

a13zen commented Jun 24, 2024

Testing by setting the desired/minimum to 0 sees the ASG terminating but then deploying 2 nodes again. This could be due to the base workload deployed by the EKS module

@srinivasreddych
Copy link
Contributor Author

Hey @a13zen I was able to test the workflow and here is an update:

  • when a GPU (for example) nodegroup (NG) is requested via the eks module manifest, the user is expected to declare the labels. for ex: usage: gpu. The EKS module will add those labels to the NG and also add them as tags as described here, which is required by Cluster Autoscaler (CA) to scale from Zero.

k8s.io/cluster-autoscaler/node-template/label/usage: gpu

Expectation: when a user launches a GPU pod/job (for this context), the CA will query the tags on the GPU NG and scale out appropriately, thereby running the GPU pod/job. When the GPU NG is launched, it is expected behavior that aws-cni, kube-proxy, nvidia-device-plugin will be launched. Once the GPU pod functionality is executed, the EC2 instance will be terminated by CA.

Having said the above, i am thinking off a design where i would refactor the EKS module to launch a system NG with m5.large instance type always to accommodate drivers, system pods etc and let the user declare the required NGs as per requirement. Thoughts?

@srinivasreddych srinivasreddych self-assigned this Jul 11, 2024
@a13zen
Copy link
Contributor

a13zen commented Jul 17, 2024

Yes, having a simple system NG with small instances could be a good middle ground for sure. Do we know if m5.large would be sufficient for the default services deployed by the EKS module?

@srinivasreddych
Copy link
Contributor Author

From my understanding, m5.large (2vCPU, 8gb RAM) should be sufficient, but depending on the number of plugins/drivers we/user deploys, the count of them should be >1. So starting with instance count = 2 should be a safe bet. Thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants