Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add taint to user and worker nodes #2605

Open
wants to merge 31 commits into
base: main
Choose a base branch
from
Open

Add taint to user and worker nodes #2605

wants to merge 31 commits into from

Conversation

Adam-D-Lewis
Copy link
Member

@Adam-D-Lewis Adam-D-Lewis commented Aug 1, 2024

Reference Issues or PRs

Fixes #2507

  • I need to test running pods with Argo Workflow through Nebari Workflow Controller before merging this PR

What does this implement/fix?

Put a x in the boxes that apply

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds a feature)
  • Breaking change (fix or feature that would cause existing features not to work as expected)
  • Documentation Update
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes, no API changes)
  • Build related changes
  • Other (please describe):

Testing

  • Did you test the pull request locally?
  • Did you add new tests?

Any other comments?

@@ -41,10 +41,33 @@ class ExistingInputVars(schema.Base):
kube_context: str


class DigitalOceanNodeGroup(schema.Base):
Copy link
Member Author

@Adam-D-Lewis Adam-D-Lewis Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate class, so I deleted it

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Aug 19, 2024

This method works as intended when tested on GCP. However, One issue is that certain daemonsets won't run on the tainted nodes. I saw the issue with rook ceph csi-cephfslplugin from my rook PR, but I expect it would also be an issue for the monitoring daemonset pods. So we'd likely need to add the appropriate toleration to those daemonsets.

@@ -45,6 +45,13 @@ resource "helm_release" "rook-ceph" {
},
csi = {
enableRbdDriver = false, # necessary to provision block storage, but saves some cpu and memory if not needed
provisionerReplicas : 1, # default is 2 on different nodes
pluginTolerations = [
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runs csi-driver on all nodes, even those with NoSchedule taints. Doesn't run on nodes with NoExecute taints. This is what the nebari-prometheus-node-exporter daemonset does so I copied it here.

effect = "NoSchedule"
},
{
operator = "Exists"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runs promtail on all nodes, even those with NoSchedule taints. Doesn't run on nodes with NoExecute taints. This is what the nebari-prometheus-node-exporter daemonset does so I copied it here. Promtail is what exports logs from the node so we still want it to run on the user and worker nodes.

Comment on lines +100 to +109
{
key = "node-role.kubernetes.io/master"
operator = "Exists"
effect = "NoSchedule"
},
{
key = "node-role.kubernetes.io/control-plane"
operator = "Exists"
effect = "NoSchedule"
},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These top 2 are the default value for this helm chart.

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Aug 21, 2024

Okay, so things are working for the user node group. I tried adding a taint to the worker node group, but the dask scheduler won't run on the tainted worker node group. See this commit to see what I tried in a quick test. I do see the new scheduler_pod_extra_config value in /var/lib/dask-gateway/config.json in the dask gateway pod, but the scheduler tolerations look like

│   tolerations:                                                                                                                                                                            │
│   - effect: NoExecute                                                                                                                                                                     │
│     key: node.kubernetes.io/not-ready                                                                                                                                                     │
│     operator: Exists                                                                                                                                                                      │
│     tolerationSeconds: 300                                                                                                                                                                │
│   - effect: NoExecute                                                                                                                                                                     │
│     key: node.kubernetes.io/unreachable                                                                                                                                                   │
│     operator: Exists                                                                                                                                                                      │
│     tolerationSeconds: 300      

so I think possibly the merge isn't going as expected, but I need to verify. The docs say that "This dict will be deep merged with the scheduler pod spec (a V1PodSpec object) before submission. Keys should match those in the kubernetes spec, and should be camelCase."

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Oct 25, 2024

I managed to get the taints applied to the scheduler pod in this commit. I would have expected the c.KubeClusterConfig.scheduler_extra_pod_config to get merged with the options returned by the function passed to c.Backend.cluster_options, but it wasn't.

  • I should verify this and maybe submit an issue to dask-gateway.

I still need to apply the toleration to the dask workers.

@@ -227,18 +229,23 @@ def base_username_mount(username, uid=1000, gid=100):
}


def worker_profile(options, user):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed this function since it affects the scheduler as well and not just the worker

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Oct 31, 2024

Okay things were working as expected for the jupyterlab pod and the dask worker and scheduler pods on GKE. I need to test on:

  • AWS
  • Azure.

I also need to test:

  • running an Argo Workflows pod. (Update: This worked. The taints were copied over when run with jupyterflow-override.)

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Oct 31, 2024

  • We should probably disallow taints on the general node since nothing will work correctly and it'd be a lot of work to fix that and no has asked for it yet.

Update: I don't think I can do this yet b/c we don't have a way to enforce that the general node group name is "general". I believe it could currently be called anything and Nebari should still deploy/work correctly. I opened an issue about enforcing the "general", "user", and "worker" node group names. We could allow abstract away the node group names from what Nebari refers to as general, user, and worker node groups but it'd save some work if it's not needed. I opened a discussion around node group names - https://github.com/orgs/nebari-dev/discussions/2816.

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Nov 1, 2024

  • make sure this PR won't break if using existing provider or local providers

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Nov 1, 2024

  • During upgrade, we'll need to add taints to each of the node groups listed or else the users will get an error during deployment.

Update: this is resolved now.

@Adam-D-Lewis
Copy link
Member Author

Okay, I think things are working on AWS. I resolved some issues on Azure, but still need to test Azure.

@Adam-D-Lewis Adam-D-Lewis changed the title Add taint to user nodes Add taint to user and worker nodes Nov 4, 2024
@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Nov 4, 2024

I tested on GCP, Azure, and AWS and works well on all. I did the following in my test:

  • Created an admin user
  • launched a dask cluster
  • verified taints set on user and worker nodes
  • verified toleration set on jupyter user pod and dask scheduler/worker pods

I also tested removing the taints on Azure and AWS and saw that the taints were removed successfully.

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Nov 4, 2024

  • I want to create an issue at least to prompt users on upgrade to ask if they want to add the taints for potential cost reductions.

Update: done now - #2824

@@ -150,6 +201,22 @@ class AWSNodeGroupInputVars(schema.Base):
permissions_boundary: Optional[str] = None
ami_type: Optional[AWSAmiTypes] = None
launch_template: Optional[AWSNodeLaunchTemplate] = None
node_taints: list[dict]

@field_validator("node_taints", mode="before")
Copy link
Member Author

@Adam-D-Lewis Adam-D-Lewis Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is repeated (see line 233 in this file) for GCP and AWS NodeGroupInputVars classes, but that's b/c the format expected by GCP and AWS terraform modules for taints happens to be the same. I think the required formats for the different modules could evolve separately and so I chose to duplicate the code in this case.

@Adam-D-Lewis
Copy link
Member Author

We should add some instructions to the docs about adding gpu nodes. Users should add the user taint to other gpu user node profiles in order to prevent the same issue this PR prevents.

@Adam-D-Lewis Adam-D-Lewis added this to the 2024.11.2 release milestone Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: New 🚦
Development

Successfully merging this pull request may close these issues.

[BUG] - Nodes don't scale down on GKE and AKS
1 participant