Add taint to user and worker nodes #2605

Adam-D-Lewis · 2024-08-01T14:58:13Z

Reference Issues or PRs

Fixes #2507

I need to test running pods with Argo Workflow through Nebari Workflow Controller before merging this PR

What does this implement/fix?

Put a x in the boxes that apply

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds a feature)
Breaking change (fix or feature that would cause existing features not to work as expected)
Documentation Update
Code style update (formatting, renaming)
Refactoring (no functional changes, no API changes)
Build related changes
Other (please describe):

Testing

Did you test the pull request locally?
Did you add new tests?

Any other comments?

Adam-D-Lewis · 2024-08-19T22:17:05Z

src/_nebari/stages/infrastructure/__init__.py

@@ -41,10 +41,33 @@ class ExistingInputVars(schema.Base):
    kube_context: str


-class DigitalOceanNodeGroup(schema.Base):


Duplicate class, so I deleted it

Adam-D-Lewis · 2024-08-19T22:18:58Z

This method works as intended when tested on GCP. However, One issue is that certain daemonsets won't run on the tainted nodes. I saw the issue with rook ceph csi-cephfslplugin from my rook PR, but I expect it would also be an issue for the monitoring daemonset pods. So we'd likely need to add the appropriate toleration to those daemonsets.

Adam-D-Lewis · 2024-08-21T22:14:22Z

src/_nebari/stages/kubernetes_services/template/rook-ceph.tf

@@ -45,6 +45,13 @@ resource "helm_release" "rook-ceph" {
      },
      csi = {
        enableRbdDriver = false, # necessary to provision block storage, but saves some cpu and memory if not needed
+        provisionerReplicas : 1, # default is 2 on different nodes
+        pluginTolerations = [


runs csi-driver on all nodes, even those with NoSchedule taints. Doesn't run on nodes with NoExecute taints. This is what the nebari-prometheus-node-exporter daemonset does so I copied it here.

Adam-D-Lewis · 2024-08-21T22:15:11Z

...bari/stages/kubernetes_services/template/modules/kubernetes/services/monitoring/loki/main.tf

+          effect   = "NoSchedule"
+        },
+        {
+          operator = "Exists"


runs promtail on all nodes, even those with NoSchedule taints. Doesn't run on nodes with NoExecute taints. This is what the nebari-prometheus-node-exporter daemonset does so I copied it here. Promtail is what exports logs from the node so we still want it to run on the user and worker nodes.

Adam-D-Lewis · 2024-08-21T22:15:40Z

...bari/stages/kubernetes_services/template/modules/kubernetes/services/monitoring/loki/main.tf

+        {
+          key      = "node-role.kubernetes.io/master"
+          operator = "Exists"
+          effect   = "NoSchedule"
+        },
+        {
+          key      = "node-role.kubernetes.io/control-plane"
+          operator = "Exists"
+          effect   = "NoSchedule"
+        },


These top 2 are the default value for this helm chart.

Adam-D-Lewis · 2024-08-21T23:30:12Z

Okay, so things are working for the user node group. I tried adding a taint to the worker node group, but the dask scheduler won't run on the tainted worker node group. See this commit to see what I tried in a quick test. I do see the new scheduler_pod_extra_config value in /var/lib/dask-gateway/config.json in the dask gateway pod, but the scheduler tolerations look like

│   tolerations:                                                                                                                                                                            │
│   - effect: NoExecute                                                                                                                                                                     │
│     key: node.kubernetes.io/not-ready                                                                                                                                                     │
│     operator: Exists                                                                                                                                                                      │
│     tolerationSeconds: 300                                                                                                                                                                │
│   - effect: NoExecute                                                                                                                                                                     │
│     key: node.kubernetes.io/unreachable                                                                                                                                                   │
│     operator: Exists                                                                                                                                                                      │
│     tolerationSeconds: 300

so I think possibly the merge isn't going as expected, but I need to verify. The docs say that "This dict will be deep merged with the scheduler pod spec (a V1PodSpec object) before submission. Keys should match those in the kubernetes spec, and should be camelCase."

Adam-D-Lewis · 2024-10-25T21:46:20Z

I managed to get the taints applied to the scheduler pod in this commit. I would have expected the c.KubeClusterConfig.scheduler_extra_pod_config to get merged with the options returned by the function passed to c.Backend.cluster_options, but it wasn't.

I should verify this and maybe submit an issue to dask-gateway.

I still need to apply the toleration to the dask workers.

Adam-D-Lewis · 2024-10-31T22:22:44Z

...ubernetes_services/template/modules/kubernetes/services/dask-gateway/files/gateway_config.py

@@ -227,18 +229,23 @@ def base_username_mount(username, uid=1000, gid=100):
    }


-def worker_profile(options, user):


I renamed this function since it affects the scheduler as well and not just the worker

Adam-D-Lewis · 2024-10-31T22:43:40Z

Okay things were working as expected for the jupyterlab pod and the dask worker and scheduler pods on GKE. I need to test on:

AWS
Azure.

I also need to test:

running an Argo Workflows pod. (Update: This worked. The taints were copied over when run with jupyterflow-override.)

Adam-D-Lewis · 2024-10-31T22:52:35Z

~~We should probably disallow taints on the general node since nothing will work correctly and it'd be a lot of work to fix that and no has asked for it yet.~~

Update: I don't think I can do this yet b/c we don't have a way to enforce that the general node group name is "general". I believe it could currently be called anything and Nebari should still deploy/work correctly. I opened an issue about enforcing the "general", "user", and "worker" node group names. We could allow abstract away the node group names from what Nebari refers to as general, user, and worker node groups but it'd save some work if it's not needed. I opened a discussion around node group names - https://github.com/orgs/nebari-dev/discussions/2816.

Adam-D-Lewis · 2024-11-01T14:11:51Z

make sure this PR won't break if using existing provider or local providers

Adam-D-Lewis · 2024-11-01T17:28:47Z

~~During upgrade, we'll need to add taints to each of the node groups listed or else the users will get an error during deployment.~~

Update: this is resolved now.

src/_nebari/stages/infrastructure/__init__.py

Adam-D-Lewis · 2024-11-01T21:48:56Z

Okay, I think things are working on AWS. I resolved some issues on Azure, but still need to test Azure.

Adam-D-Lewis · 2024-11-04T18:18:22Z

I tested on GCP, Azure, and AWS and works well on all. I did the following in my test:

Created an admin user
launched a dask cluster
verified taints set on user and worker nodes
verified toleration set on jupyter user pod and dask scheduler/worker pods

I also tested removing the taints on Azure and AWS and saw that the taints were removed successfully.

Adam-D-Lewis · 2024-11-04T18:21:32Z

I want to create an issue at least to prompt users on upgrade to ask if they want to add the taints for potential cost reductions.

Update: done now - #2824

Adam-D-Lewis · 2024-11-04T18:26:05Z

src/_nebari/stages/infrastructure/__init__.py

@@ -150,6 +201,22 @@ class AWSNodeGroupInputVars(schema.Base):
    permissions_boundary: Optional[str] = None
    ami_type: Optional[AWSAmiTypes] = None
    launch_template: Optional[AWSNodeLaunchTemplate] = None
+    node_taints: list[dict]
+
+    @field_validator("node_taints", mode="before")


This code is repeated (see line 233 in this file) for GCP and AWS NodeGroupInputVars classes, but that's b/c the format expected by GCP and AWS terraform modules for taints happens to be the same. I think the required formats for the different modules could evolve separately and so I chose to duplicate the code in this case.

Adam-D-Lewis · 2024-11-06T15:39:06Z

We should add some instructions to the docs about adding gpu nodes. Users should add the user taint to other gpu user node profiles in order to prevent the same issue this PR prevents.

Adam-D-Lewis added 6 commits June 26, 2024 09:59

save progress

5000f06

Merge branch 'develop' into node-taint

7ce8555

fix node taint check

a661514

Merge branch 'develop' into node-taint

fb55fab

fix node taints on gcp

7f1800d

add latest changes

40940f6

Adam-D-Lewis commented Aug 19, 2024

View reviewed changes

Adam-D-Lewis added 2 commits August 21, 2024 12:11

merge develop

cdac5c6

allow daemonsets to run on user node group

6382c7b

Adam-D-Lewis commented Aug 21, 2024

View reviewed changes

Adam-D-Lewis added 2 commits August 21, 2024 18:23

recreate node groups when taints change

e9d9dd9

quick attempt to get scheduler running on tanted worker node group

c55cd5f

Adam-D-Lewis added 2 commits October 25, 2024 14:50

Merge branch 'main' into node-taint

57e6e09

add default options to options_handler

a1370c9

Adam-D-Lewis added 8 commits October 28, 2024 09:33

add comments

0e7e11c

rename variable

adb9d74

add comment

7944071

make work for all providers

fa81fb9

move var back

da9fd82

move var back

6a1f81d

move var back

b4c08f3

move var back

9bae2a1

Adam-D-Lewis commented Oct 31, 2024

View reviewed changes

add reference

b3dbeda

refactor

97858d0

Adam-D-Lewis commented Nov 1, 2024

View reviewed changes

src/_nebari/stages/infrastructure/__init__.py Outdated Show resolved Hide resolved

Adam-D-Lewis added 2 commits November 1, 2024 16:44

various fixes for aws and azure providers

4ac7b9c

Merge branch 'main' into node-taint

480647b

Adam-D-Lewis changed the title ~~Add taint to user nodes~~ Add taint to user and worker nodes Nov 4, 2024

Adam-D-Lewis added 3 commits November 4, 2024 11:17

add taint conversion for AWS

f6b9a4f

add DEFAULT_.*_TAINT vars

e752a3a

clean up fixed TODOs

59daa0c

Adam-D-Lewis commented Nov 4, 2024

View reviewed changes

Adam-D-Lewis added 4 commits November 4, 2024 12:35

more clean up

e05f143

Merge branch 'main' into node-taint

3a4ae6b

fix test

f3cb2e9

fix test error

b125e8c

Adam-D-Lewis mentioned this pull request Nov 4, 2024

Ask users if they'd like to have default taints added to the user and worker node groups of Nebari for potential cost savings #2824

Open

Adam-D-Lewis marked this pull request as ready for review November 4, 2024 19:51

Adam-D-Lewis requested review from dcmcand, viniciusdc and marcelovilla November 4, 2024 19:55

Adam-D-Lewis added this to the 2024.11.2 release milestone Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add taint to user and worker nodes #2605

Add taint to user and worker nodes #2605

Adam-D-Lewis commented Aug 1, 2024 •

edited

Loading

Adam-D-Lewis Aug 19, 2024 •

edited

Loading

Adam-D-Lewis commented Aug 19, 2024 •

edited

Loading

Adam-D-Lewis Aug 21, 2024

Adam-D-Lewis Aug 21, 2024

Adam-D-Lewis Aug 21, 2024

Adam-D-Lewis commented Aug 21, 2024 •

edited

Loading

Adam-D-Lewis commented Oct 25, 2024 •

edited

Loading

Adam-D-Lewis Oct 31, 2024

Adam-D-Lewis commented Oct 31, 2024 •

edited

Loading

Adam-D-Lewis commented Oct 31, 2024 •

edited

Loading

Adam-D-Lewis commented Nov 1, 2024 •

edited

Loading

Adam-D-Lewis commented Nov 1, 2024 •

edited

Loading

Adam-D-Lewis commented Nov 1, 2024

Adam-D-Lewis commented Nov 4, 2024 •

edited

Loading

Adam-D-Lewis commented Nov 4, 2024 •

edited

Loading

Adam-D-Lewis Nov 4, 2024 •

edited

Loading

Adam-D-Lewis commented Nov 6, 2024

		@@ -41,10 +41,33 @@ class ExistingInputVars(schema.Base):
		kube_context: str


		class DigitalOceanNodeGroup(schema.Base):

		@@ -227,18 +229,23 @@ def base_username_mount(username, uid=1000, gid=100):
		}


		def worker_profile(options, user):

Add taint to user and worker nodes #2605

Are you sure you want to change the base?

Add taint to user and worker nodes #2605

Conversation

Adam-D-Lewis commented Aug 1, 2024 • edited Loading

Reference Issues or PRs

What does this implement/fix?

Testing

Any other comments?

Adam-D-Lewis Aug 19, 2024 • edited Loading

Choose a reason for hiding this comment

Adam-D-Lewis commented Aug 19, 2024 • edited Loading

Adam-D-Lewis Aug 21, 2024

Choose a reason for hiding this comment

Adam-D-Lewis Aug 21, 2024

Choose a reason for hiding this comment

Adam-D-Lewis Aug 21, 2024

Choose a reason for hiding this comment

Adam-D-Lewis commented Aug 21, 2024 • edited Loading

Adam-D-Lewis commented Oct 25, 2024 • edited Loading

Adam-D-Lewis Oct 31, 2024

Choose a reason for hiding this comment

Adam-D-Lewis commented Oct 31, 2024 • edited Loading

Adam-D-Lewis commented Oct 31, 2024 • edited Loading

Adam-D-Lewis commented Nov 1, 2024 • edited Loading

Adam-D-Lewis commented Nov 1, 2024 • edited Loading

Adam-D-Lewis commented Nov 1, 2024

Adam-D-Lewis commented Nov 4, 2024 • edited Loading

Adam-D-Lewis commented Nov 4, 2024 • edited Loading

Adam-D-Lewis Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

Adam-D-Lewis commented Nov 6, 2024

Adam-D-Lewis commented Aug 1, 2024 •

edited

Loading

Adam-D-Lewis Aug 19, 2024 •

edited

Loading

Adam-D-Lewis commented Aug 19, 2024 •

edited

Loading

Adam-D-Lewis commented Aug 21, 2024 •

edited

Loading

Adam-D-Lewis commented Oct 25, 2024 •

edited

Loading

Adam-D-Lewis commented Oct 31, 2024 •

edited

Loading

Adam-D-Lewis commented Oct 31, 2024 •

edited

Loading

Adam-D-Lewis commented Nov 1, 2024 •

edited

Loading

Adam-D-Lewis commented Nov 1, 2024 •

edited

Loading

Adam-D-Lewis commented Nov 4, 2024 •

edited

Loading

Adam-D-Lewis commented Nov 4, 2024 •

edited

Loading

Adam-D-Lewis Nov 4, 2024 •

edited

Loading