Skip to content

Commit

Permalink
Applying dynamic taint and nvidia daemonset
Browse files Browse the repository at this point in the history
  • Loading branch information
mclacore committed May 2, 2024
1 parent 2adb5b2 commit 4152193
Show file tree
Hide file tree
Showing 6 changed files with 92 additions and 2 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ Form input parameters for configuring a bundle for deployment.
- **`prometheus`** *(object)*: Configuration settings for the Prometheus instances that are automatically installed into the cluster to provide monitoring capabilities".
- **`grafana_enabled`** *(boolean)*: Install Grafana into the cluster to provide a metric visualizer. Default: `False`.
- **`persistence_enabled`** *(boolean)*: This setting will enable persistence of Prometheus data via EBS volumes. However, in small clusters (less than 5 nodes) this can create problems of pod scheduling and placement due EBS volumes being zonally-locked, and thus should be disabled. Default: `True`.
- **`node_groups`** *(array)*
- **`node_groups`** *(array)*: Node groups to provision.
- **Items** *(object)*: Definition of a node group.
- **`advanced_configuration_enabled`** *(boolean)*: Default: `False`.
- **`instance_type`** *(string)*: Instance type to use in the node group.
Expand Down
81 changes: 81 additions & 0 deletions core-services/nvidia_gpu.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
resource "kubernetes_daemonset" "nvidia" {
count = length([for ng in var.node_groups : ng if can(regex("^[p0-9]\\..*", ng.instance_type))]) > 0 ? 1 : 0
metadata {
name = "nvidia-device-plugin-daemonset"
namespace = kubernetes_namespace_v1.md-core-services.metadata.0.name
labels = merge(var.md_metadata.default_tags, {
k8s-app = "nvidia-device-plugin-daemonset"
})
}
spec {
selector {
match_labels = {
name = "nvidia-device-plugin-ds"
}
}
strategy {
type = "RollingUpdate"
}
template {
metadata {
labels = merge(var.md_metadata.default_tags, {
name = "nvidia-device-plugin-ds"
})
annotations = {
"scheduler.alpha.kubernetes.io/critical-pod" : ""
}
}
spec {
affinity {
node_affinity {
required_during_scheduling_ignored_during_execution {
node_selector_term {
match_expressions {
key = "accelerator"
operator = "In"
values = ["nvidia"]
}
}
}
}
}
toleration {
key = "CriticalAddonsOnly"
operator = "Exists"
}
toleration {
key = "nvidia.com/gpu"
operator = "Exists"
effect = "NoSchedule"
}
toleration {
key = "sku"
operator = "Equal"
value = "gpu"
effect = "NoSchedule"
}
container {
name = "nvidia-device-plugin-ctr"
image = "nvcr.io/nvidia/k8s-device-plugin:v0.9.0"
args = ["--fail-on-init-error=false"]
security_context {
privileged = false
capabilities {
drop = ["all"]
}
}
volume_mount {
name = "device-plugin"
mount_path = "/var/lib/kubelet/device-plugins"
}
}
volume {
name = "device-plugin"
host_path {
path = "/var/lib/kubelet/device-plugins"
}
}
}
}
}
}
Binary file added core-services/tf.plan
Binary file not shown.
Binary file added custom-resources/tf.plan
Binary file not shown.
2 changes: 1 addition & 1 deletion massdriver.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ params:
node_groups:
type: array
title: Node Groups
descrition: Node groups to provision
description: Node groups to provision
minItems: 1
items:
type: object
Expand Down
9 changes: 9 additions & 0 deletions src/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,15 @@ resource "aws_eks_node_group" "node_group" {
min_size = each.value.min_size
}

dynamic "taint" {
for_each = [for ng in var.node_groups : ng if can(regex("^p[0-9]\\..*", ng.instance_type))]
content {
key = "sku"
value = "gpu"
effect = "NO_SCHEDULE"
}
}

dynamic "taint" {
for_each = lookup(each.value, "advanced_configuration_enabled", false) ? [each.value.advanced_configuration.taint] : []
content {
Expand Down

0 comments on commit 4152193

Please sign in to comment.