-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Applying dynamic taint and nvidia daemonset #64
Conversation
Confirmed the nodes are getting bounced during upgrade:
|
EBS CSI driver times out when deploying to a standalone GPU node group because of taint toleration from EBS. Not worth investing time into it. As long as a compute or other type of node group is deployed alongside GPU node group, it'll work. |
Confirmed that both the main branch AMI and new branch AMI (manually declaring newest AMI) are getting the same image ID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions before approving.
key = "CriticalAddonsOnly" | ||
operator = "Exists" | ||
} | ||
toleration { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this toleration needed? We aren't applying it, are we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It comes with the NVIDIA daemonset plugin: https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml
And then used by GPU pods it seems? https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#running-gpu-jobs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they are just using that as the taint in place of the gpu=true
taint we added. We probably don't need both. We can use theirs instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To confirm, remove the toleration for sku=gpu:NoSchedule
and update the dynamic taint in the node group for nvidia.com/gpu:NoSchedule
?
node_role_arn = aws_iam_role.node.arn | ||
instance_types = [each.value.instance_type] | ||
for_each = { for ng in var.node_groups : ng.name_suffix => ng } | ||
node_group_name_prefix = "${local.cluster_name}-${each.value.name_suffix}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is probably what is causing the recreation of all the nodes. Why are we switching from node_group_name
to node_group_name_prefix
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of node group name collision. Prior to this change, when updating the launch template and AMI, etc., I would receive an error saying the node group name already existed. By using prefix
instead, the same node group name can be used but each one will be unique due to an added suffix.
<<EOF | ||
#!/bin/bash | ||
set -o xtrace | ||
/etc/eks/bootstrap.sh ${local.cluster_name} --kubelet-extra-args '--node-labels=node.kubernetes.io/instancegroup=${each.key}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is anything else needed in this file? Did you check to see what this file looked like on a default node before this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did look and it's a massive file. I'll paste the contents of it in here after deploying a main branch cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is substantial enough, let's shelve it for now until we have marketplace/private registries implemented. We don't want to roll this out to all production EKS clusters right now. |
Changes in this PR:
node_group_name
tonode_group_name_prefix
to avoid node group name collisions (confirmed this is a backwards compatible change)version
field fromaws_eks_node_group
because the lookup and setting of AMI in the launch template (asimage_id
) now sets the k8s version for the node groupimage_id
setting for launch templatebootstrap.sh
script call that AMIs depend on to join EKS clusterTests run against changes:
TODO:
CUSTOM
AMI type