Applying dynamic taint and nvidia daemonset #64

mclacore · 2024-05-02T22:49:41Z

Changes in this PR:

Adding NVIDIA daemonset
Renamed node_group_name to node_group_name_prefix to avoid node group name collisions (confirmed this is a backwards compatible change)
Added data lookups for Amazon Linux AMIs (GPU and non-GPU)
Added regex for various GPU enabled instance types
Removed version field from aws_eks_node_group because the lookup and setting of AMI in the launch template (as image_id) now sets the k8s version for the node group
Added image_id setting for launch template
Added dynamic taint application for GPU node groups
Added a bootstrap.sh script call that AMIs depend on to join EKS cluster

Tests run against changes:

Fresh cluster deploy with 1 non-GPU node group and 1 GPU node group✅
Fresh cluster deploy with 1 GPU node group only ❌
Upgrade in place of existing cluster (1 non-GPU node group) ✅
Upgrade in place of existing cluster (2 non-GPU node groups) ✅
Upgrade in place of existing cluster + adding GPU node group ✅
Decommissioning cluster without upgrade ✅

TODO:

Test removing SSM identity from GPU and retest GPU workload
Remove CUSTOM AMI type
Find out why GPU can't deploy standalone due to EBS CSI driver failure

linear · 2024-05-02T22:49:44Z

ORC-515 AWS EKS GPU support

mclacore · 2024-05-29T18:09:37Z

Confirmed the nodes are getting bounced during upgrade:

➜  aws-eks-cluster git:(main) ✗ k get no
NAME                          STATUS   ROLES    AGE   VERSION
ip-10-0-123-28.ec2.internal   Ready    <none>   15m   v1.27.12-eks-ae9a62a
ip-10-0-85-147.ec2.internal   Ready    <none>   19m   v1.27.12-eks-ae9a62a
➜  aws-eks-cluster git:(main) ✗ k get no -w
NAME                          STATUS   ROLES    AGE   VERSION
ip-10-0-123-28.ec2.internal   Ready    <none>   16m   v1.27.12-eks-ae9a62a
ip-10-0-85-147.ec2.internal   Ready    <none>   19m   v1.27.12-eks-ae9a62a
ip-10-0-123-28.ec2.internal   Ready    <none>   16m   v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   NotReady   <none>   0s    v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   NotReady   <none>   0s    v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   NotReady   <none>   0s    v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   NotReady   <none>   0s    v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   NotReady   <none>   0s    v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   NotReady   <none>   0s    v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   NotReady   <none>   1s    v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   NotReady   <none>   10s   v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   Ready      <none>   11s   v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   Ready      <none>   11s   v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   Ready      <none>   16s   v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   Ready      <none>   23s   v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   Ready      <none>   30s   v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   Ready      <none>   31s   v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   Ready      <none>   61s   v1.27.12-eks-ae9a62a
ip-10-0-85-147.ec2.internal   Ready      <none>   23m   v1.27.12-eks-ae9a62a
ip-10-0-123-28.ec2.internal   Ready      <none>   19m   v1.27.12-eks-ae9a62a
ip-10-0-85-147.ec2.internal   Ready,SchedulingDisabled   <none>   23m   v1.27.12-eks-ae9a62a
ip-10-0-85-147.ec2.internal   Ready,SchedulingDisabled   <none>   23m   v1.27.12-eks-ae9a62a
ip-10-0-123-28.ec2.internal   Ready,SchedulingDisabled   <none>   19m   v1.27.12-eks-ae9a62a
ip-10-0-123-28.ec2.internal   Ready,SchedulingDisabled   <none>   19m   v1.27.12-eks-ae9a62a
ip-10-0-85-147.ec2.internal   Ready,SchedulingDisabled   <none>   23m   v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   Ready                      <none>   92s   v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   Ready                      <none>   2m3s   v1.27.12-eks-ae9a62a
ip-10-0-123-28.ec2.internal   NotReady,SchedulingDisabled   <none>   21m    v1.27.12-eks-ae9a62a
ip-10-0-123-28.ec2.internal   NotReady,SchedulingDisabled   <none>   21m    v1.27.12-eks-ae9a62a
ip-10-0-56-188.ec2.internal   NotReady                      <none>   0s     v1.27.12-eks-ae9a62a
ip-10-0-56-188.ec2.internal   NotReady                      <none>   0s     v1.27.12-eks-ae9a62a
ip-10-0-56-188.ec2.internal   NotReady                      <none>   0s     v1.27.12-eks-ae9a62a
ip-10-0-56-188.ec2.internal   NotReady                      <none>   0s     v1.27.12-eks-ae9a62a
ip-10-0-56-188.ec2.internal   NotReady                      <none>   0s     v1.27.12-eks-ae9a62a
ip-10-0-56-188.ec2.internal   NotReady                      <none>   0s     v1.27.12-eks-ae9a62a
ip-10-0-56-188.ec2.internal   NotReady                      <none>   2s     v1.27.12-eks-ae9a62a
ip-10-0-123-28.ec2.internal   NotReady,SchedulingDisabled   <none>   21m    v1.27.12-eks-ae9a62a
ip-10-0-56-188.ec2.internal   NotReady                      <none>   3s     v1.27.12-eks-ae9a62a
ip-10-0-56-188.ec2.internal   NotReady                      <none>   10s    v1.27.12-eks-ae9a62a
ip-10-0-56-188.ec2.internal   Ready                         <none>   12s    v1.27.12-eks-ae9a62a
ip-10-0-56-188.ec2.internal   Ready                         <none>   12s    v1.27.12-eks-ae9a62a
ip-10-0-56-188.ec2.internal   Ready                         <none>   13s    v1.27.12-eks-ae9a62a
ip-10-0-123-28.ec2.internal   NotReady,SchedulingDisabled   <none>   21m    v1.27.12-eks-ae9a62a
ip-10-0-56-188.ec2.internal   Ready                         <none>   28s    v1.27.12-eks-ae9a62a
ip-10-0-56-188.ec2.internal   Ready                         <none>   30s    v1.27.12-eks-ae9a62a
ip-10-0-56-188.ec2.internal   Ready                         <none>   61s    v1.27.12-eks-ae9a62a
ip-10-0-85-147.ec2.internal   NotReady,SchedulingDisabled   <none>   26m    v1.27.12-eks-ae9a62a
ip-10-0-85-147.ec2.internal   NotReady,SchedulingDisabled   <none>   26m    v1.27.12-eks-ae9a62a
ip-10-0-85-147.ec2.internal   NotReady,SchedulingDisabled   <none>   26m    v1.27.12-eks-ae9a62a
ip-10-0-85-147.ec2.internal   NotReady,SchedulingDisabled   <none>   27m    v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   Ready                         <none>   7m9s   v1.27.12-eks-ae9a62a
^C%
➜  aws-eks-cluster git:(michael/#ORC-515/add-gpu-support) ✗ k get no
NAME                          STATUS   ROLES    AGE     VERSION
ip-10-0-56-188.ec2.internal   Ready    <none>   4m32s   v1.27.12-eks-ae9a62a
ip-10-0-89-222.ec2.internal   Ready    <none>   7m25s   v1.27.12-eks-ae9a62a

mclacore · 2024-05-29T18:11:05Z

EBS CSI driver times out when deploying to a standalone GPU node group because of taint toleration from EBS. Not worth investing time into it. As long as a compute or other type of node group is deployed alongside GPU node group, it'll work.

mclacore · 2024-05-29T18:12:07Z

Confirmed that both the main branch AMI and new branch AMI (manually declaring newest AMI) are getting the same image ID.

chrisghill

Some questions before approving.

massdriver.yaml

chrisghill · 2024-05-29T23:04:28Z

core-services/nvidia_gpu.tf

+          key      = "CriticalAddonsOnly"
+          operator = "Exists"
+        }
+        toleration {


Why is this toleration needed? We aren't applying it, are we?

It comes with the NVIDIA daemonset plugin: https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

And then used by GPU pods it seems? https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#running-gpu-jobs

I think they are just using that as the taint in place of the gpu=true taint we added. We probably don't need both. We can use theirs instead.

To confirm, remove the toleration for sku=gpu:NoSchedule and update the dynamic taint in the node group for nvidia.com/gpu:NoSchedule?

core-services/nvidia_gpu.tf

massdriver.yaml

chrisghill · 2024-05-29T23:06:32Z

src/main.tf

-  node_role_arn   = aws_iam_role.node.arn
-  instance_types  = [each.value.instance_type]
+  for_each               = { for ng in var.node_groups : ng.name_suffix => ng }
+  node_group_name_prefix = "${local.cluster_name}-${each.value.name_suffix}"


This change is probably what is causing the recreation of all the nodes. Why are we switching from node_group_name to node_group_name_prefix?

Because of node group name collision. Prior to this change, when updating the launch template and AMI, etc., I would receive an error saying the node group name already existed. By using prefix instead, the same node group name can be used but each one will be unique due to an added suffix.

src/main.tf

chrisghill · 2024-05-29T23:07:54Z

src/main.tf

+    <<EOF
+#!/bin/bash
+set -o xtrace
+/etc/eks/bootstrap.sh ${local.cluster_name} --kubelet-extra-args '--node-labels=node.kubernetes.io/instancegroup=${each.key}'


Is anything else needed in this file? Did you check to see what this file looked like on a default node before this change?

I did look and it's a massive file. I'll paste the contents of it in here after deploying a main branch cluster.

bootstrap.txt

chrisghill · 2024-06-05T20:51:32Z

This change is substantial enough, let's shelve it for now until we have marketplace/private registries implemented. We don't want to roll this out to all production EKS clusters right now.

Applying dynamic taint and nvidia daemonset

4152193

mclacore added 6 commits May 7, 2024 14:58

Updated tf functions to actually work, removing tf plan

2f08291

Intermittently working :(

a6d61dd

adding version for 1.29 and versioning widget

3a7aa69

ds backwards compat fix #1

e866851

renaming locals, using name prefix to avoid name collision

097eb1d

Removing unnecessary changes

b82821d

mclacore marked this pull request as ready for review May 29, 2024 18:31

mclacore requested a review from chrisghill May 29, 2024 18:31

chrisghill reviewed May 29, 2024

View reviewed changes

mclacore added 2 commits May 30, 2024 08:58

Updating instance type labels

7daebc8

taint values

c8f83b9

mclacore closed this Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Applying dynamic taint and nvidia daemonset #64

Applying dynamic taint and nvidia daemonset #64

mclacore commented May 2, 2024 •

edited

Loading

linear bot commented May 2, 2024

mclacore commented May 29, 2024

mclacore commented May 29, 2024

mclacore commented May 29, 2024

chrisghill left a comment

chrisghill May 29, 2024

mclacore May 30, 2024

chrisghill Jun 3, 2024

mclacore Jun 3, 2024

chrisghill May 29, 2024

mclacore May 30, 2024

chrisghill May 29, 2024

mclacore May 30, 2024

mclacore May 30, 2024

chrisghill commented Jun 5, 2024

Applying dynamic taint and nvidia daemonset #64

Applying dynamic taint and nvidia daemonset #64

Conversation

mclacore commented May 2, 2024 • edited Loading

linear bot commented May 2, 2024

mclacore commented May 29, 2024

mclacore commented May 29, 2024

mclacore commented May 29, 2024

chrisghill left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrisghill commented Jun 5, 2024

mclacore commented May 2, 2024 •

edited

Loading