Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Applying dynamic taint and nvidia daemonset #64

Closed
wants to merge 9 commits into from

Conversation

mclacore
Copy link
Contributor

@mclacore mclacore commented May 2, 2024

Changes in this PR:

  • Adding NVIDIA daemonset
  • Renamed node_group_name to node_group_name_prefix to avoid node group name collisions (confirmed this is a backwards compatible change)
  • Added data lookups for Amazon Linux AMIs (GPU and non-GPU)
  • Added regex for various GPU enabled instance types
  • Removed version field from aws_eks_node_group because the lookup and setting of AMI in the launch template (as image_id) now sets the k8s version for the node group
  • Added image_id setting for launch template
  • Added dynamic taint application for GPU node groups
  • Added a bootstrap.sh script call that AMIs depend on to join EKS cluster

Tests run against changes:

  • Fresh cluster deploy with 1 non-GPU node group and 1 GPU node group✅
  • Fresh cluster deploy with 1 GPU node group only ❌
  • Upgrade in place of existing cluster (1 non-GPU node group) ✅
  • Upgrade in place of existing cluster (2 non-GPU node groups) ✅
  • Upgrade in place of existing cluster + adding GPU node group ✅
  • Decommissioning cluster without upgrade ✅

TODO:

  • Test removing SSM identity from GPU and retest GPU workload
  • Remove CUSTOM AMI type
  • Find out why GPU can't deploy standalone due to EBS CSI driver failure

Copy link

linear bot commented May 2, 2024

@mclacore
Copy link
Contributor Author

Confirmed the nodes are getting bounced during upgrade:

➜  aws-eks-cluster git:(main) ✗ k get no
NAME                          STATUS   ROLES    AGE   VERSION
ip-10-0-123-28.ec2.internal   Ready    <none>   15m   v1.27.12-eks-ae9a62a
ip-10-0-85-147.ec2.internal   Ready    <none>   19m   v1.27.12-eks-ae9a62a
➜  aws-eks-cluster git:(main) ✗ k get no -w
NAME                          STATUS   ROLES    AGE   VERSION
ip-10-0-123-28.ec2.internal   Ready    <none>   16m   v1.27.12-eks-ae9a62a
ip-10-0-85-147.ec2.internal   Ready    <none>   19m   v1.27.12-eks-ae9a62a
ip-10-0-123-28.ec2.internal   Ready    <none>   16m   v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   NotReady   <none>   0s    v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   NotReady   <none>   0s    v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   NotReady   <none>   0s    v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   NotReady   <none>   0s    v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   NotReady   <none>   0s    v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   NotReady   <none>   0s    v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   NotReady   <none>   1s    v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   NotReady   <none>   10s   v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   Ready      <none>   11s   v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   Ready      <none>   11s   v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   Ready      <none>   16s   v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   Ready      <none>   23s   v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   Ready      <none>   30s   v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   Ready      <none>   31s   v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   Ready      <none>   61s   v1.27.12-eks-ae9a62a
ip-10-0-85-147.ec2.internal   Ready      <none>   23m   v1.27.12-eks-ae9a62a
ip-10-0-123-28.ec2.internal   Ready      <none>   19m   v1.27.12-eks-ae9a62a
ip-10-0-85-147.ec2.internal   Ready,SchedulingDisabled   <none>   23m   v1.27.12-eks-ae9a62a
ip-10-0-85-147.ec2.internal   Ready,SchedulingDisabled   <none>   23m   v1.27.12-eks-ae9a62a
ip-10-0-123-28.ec2.internal   Ready,SchedulingDisabled   <none>   19m   v1.27.12-eks-ae9a62a
ip-10-0-123-28.ec2.internal   Ready,SchedulingDisabled   <none>   19m   v1.27.12-eks-ae9a62a
ip-10-0-85-147.ec2.internal   Ready,SchedulingDisabled   <none>   23m   v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   Ready                      <none>   92s   v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   Ready                      <none>   2m3s   v1.27.12-eks-ae9a62a
ip-10-0-123-28.ec2.internal   NotReady,SchedulingDisabled   <none>   21m    v1.27.12-eks-ae9a62a
ip-10-0-123-28.ec2.internal   NotReady,SchedulingDisabled   <none>   21m    v1.27.12-eks-ae9a62a
ip-10-0-56-188.ec2.internal   NotReady                      <none>   0s     v1.27.12-eks-ae9a62a
ip-10-0-56-188.ec2.internal   NotReady                      <none>   0s     v1.27.12-eks-ae9a62a
ip-10-0-56-188.ec2.internal   NotReady                      <none>   0s     v1.27.12-eks-ae9a62a
ip-10-0-56-188.ec2.internal   NotReady                      <none>   0s     v1.27.12-eks-ae9a62a
ip-10-0-56-188.ec2.internal   NotReady                      <none>   0s     v1.27.12-eks-ae9a62a
ip-10-0-56-188.ec2.internal   NotReady                      <none>   0s     v1.27.12-eks-ae9a62a
ip-10-0-56-188.ec2.internal   NotReady                      <none>   2s     v1.27.12-eks-ae9a62a
ip-10-0-123-28.ec2.internal   NotReady,SchedulingDisabled   <none>   21m    v1.27.12-eks-ae9a62a
ip-10-0-56-188.ec2.internal   NotReady                      <none>   3s     v1.27.12-eks-ae9a62a
ip-10-0-56-188.ec2.internal   NotReady                      <none>   10s    v1.27.12-eks-ae9a62a
ip-10-0-56-188.ec2.internal   Ready                         <none>   12s    v1.27.12-eks-ae9a62a
ip-10-0-56-188.ec2.internal   Ready                         <none>   12s    v1.27.12-eks-ae9a62a
ip-10-0-56-188.ec2.internal   Ready                         <none>   13s    v1.27.12-eks-ae9a62a
ip-10-0-123-28.ec2.internal   NotReady,SchedulingDisabled   <none>   21m    v1.27.12-eks-ae9a62a
ip-10-0-56-188.ec2.internal   Ready                         <none>   28s    v1.27.12-eks-ae9a62a
ip-10-0-56-188.ec2.internal   Ready                         <none>   30s    v1.27.12-eks-ae9a62a
ip-10-0-56-188.ec2.internal   Ready                         <none>   61s    v1.27.12-eks-ae9a62a
ip-10-0-85-147.ec2.internal   NotReady,SchedulingDisabled   <none>   26m    v1.27.12-eks-ae9a62a
ip-10-0-85-147.ec2.internal   NotReady,SchedulingDisabled   <none>   26m    v1.27.12-eks-ae9a62a
ip-10-0-85-147.ec2.internal   NotReady,SchedulingDisabled   <none>   26m    v1.27.12-eks-ae9a62a
ip-10-0-85-147.ec2.internal   NotReady,SchedulingDisabled   <none>   27m    v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   Ready                         <none>   7m9s   v1.27.12-eks-ae9a62a
^C%
➜  aws-eks-cluster git:(michael/#ORC-515/add-gpu-support) ✗ k get no
NAME                          STATUS   ROLES    AGE     VERSION
ip-10-0-56-188.ec2.internal   Ready    <none>   4m32s   v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   Ready    <none>   7m25s   v1.27.12-eks-ae9a62a

@mclacore
Copy link
Contributor Author

EBS CSI driver times out when deploying to a standalone GPU node group because of taint toleration from EBS. Not worth investing time into it. As long as a compute or other type of node group is deployed alongside GPU node group, it'll work.

@mclacore
Copy link
Contributor Author

Confirmed that both the main branch AMI and new branch AMI (manually declaring newest AMI) are getting the same image ID.

@mclacore mclacore marked this pull request as ready for review May 29, 2024 18:31
@mclacore mclacore requested a review from chrisghill May 29, 2024 18:31
Copy link
Contributor

@chrisghill chrisghill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions before approving.

massdriver.yaml Show resolved Hide resolved
key = "CriticalAddonsOnly"
operator = "Exists"
}
toleration {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this toleration needed? We aren't applying it, are we?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they are just using that as the taint in place of the gpu=true taint we added. We probably don't need both. We can use theirs instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm, remove the toleration for sku=gpu:NoSchedule and update the dynamic taint in the node group for nvidia.com/gpu:NoSchedule?

core-services/nvidia_gpu.tf Show resolved Hide resolved
massdriver.yaml Show resolved Hide resolved
node_role_arn = aws_iam_role.node.arn
instance_types = [each.value.instance_type]
for_each = { for ng in var.node_groups : ng.name_suffix => ng }
node_group_name_prefix = "${local.cluster_name}-${each.value.name_suffix}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is probably what is causing the recreation of all the nodes. Why are we switching from node_group_name to node_group_name_prefix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of node group name collision. Prior to this change, when updating the launch template and AMI, etc., I would receive an error saying the node group name already existed. By using prefix instead, the same node group name can be used but each one will be unique due to an added suffix.

src/main.tf Outdated Show resolved Hide resolved
<<EOF
#!/bin/bash
set -o xtrace
/etc/eks/bootstrap.sh ${local.cluster_name} --kubelet-extra-args '--node-labels=node.kubernetes.io/instancegroup=${each.key}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is anything else needed in this file? Did you check to see what this file looked like on a default node before this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did look and it's a massive file. I'll paste the contents of it in here after deploying a main branch cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chrisghill
Copy link
Contributor

This change is substantial enough, let's shelve it for now until we have marketplace/private registries implemented. We don't want to roll this out to all production EKS clusters right now.

@mclacore mclacore closed this Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants