Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taint nodes before deletion #621

Open
hamishforbes opened this issue Oct 5, 2022 · 14 comments
Open

Taint nodes before deletion #621

hamishforbes opened this issue Oct 5, 2022 · 14 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. v1 Issues requiring resolution by the v1 milestone

Comments

@hamishforbes
Copy link

Tell us about your request

When removing nodes due to consolidation I would like to be able to apply a taint to the node before it is removed.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Reason for this is to be able to gracefully stop DaemonSet pods, see related issues below

I have consul agents running on nodes via DaemonSet, these agents join the cluster.
If they are just killed then they sit around in the cluster as failed, if the pod is given a stop signal then it will gracefully leave the cluster and then exit.

When a node is just deleted it leaves a bunch of hanging agents in my Consul cluster.
Applying a NoExecute taint prior to deletion will evict those pods.

System DaemonSets (e.g. Kube-proxy) tolerate all taints and so this won't evict those.

Are you currently working around this issue?

Without Karpenter nodes are generally only removed
a) Manually, in which case I manually taint the node with a noExecute taint
b) By the node-termination-handler which is configured to add a taint as well

With Karpenter... well the workaround is to manually clear out failed nodes from my Consul cluster or get this feature added!

Additional Context

aws/aws-node-termination-handler#273
kubernetes/kubernetes#75482

Attachments

No response

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@hamishforbes hamishforbes added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 5, 2022
@ellistarn
Copy link
Contributor

When a node is just deleted it leaves a bunch of hanging agents in my Consul cluster.

This is a bit confusing to me -- the term cluster appears to mean consul cluster, but isn't always disambiguated from kubernetes cluster. If I understand correctly, you don't have hanging pods in your Kubernetes cluster, but you do have hanging agents in your Consul cluster. Is it correct that the way for Consul agents to clean up is with termination logic in the pod that hosts it?

Consolidation relies on a unified "termination controller" in karpenter, so its cordon+drain logic is identical to other forms of termination (e.g. expiry).

Every node we terminate undergoes the following process:

  1. Taint the node
  2. Identify evictable pods
  3. Pods in "Succeeded" or "Failed" are ignored.
  4. Pods that are "stuck terminating" (i.e. beyond deletion timestamp) are ignored
  5. Pods that tolerate Unschedulable=NoSchedule are ignored, since it would trigger an eviction loop

It's not clear to me which of these steps you're falling under, but I don't believe tainting the node would solve your problem. We should be issuing an evict for any running Consul pod and allowing it to clean up in its GracefulTerminationPeriod. Have you been able to observe this in action for more details?

@hamishforbes
Copy link
Author

This is a bit confusing to me -- the term cluster appears to mean consul cluster, but isn't always disambiguated from kubernetes cluster. If I understand correctly, you don't have hanging pods in your Kubernetes cluster, but you do have hanging agents in your Consul cluster. Is it correct that the way for Consul agents to clean up is with termination logic in the pod that hosts it?

Yes correct, sorry should've been clearer when i was referring to the kube cluster and when the consul cluster. Overloaded terms!

Consolidation relies on a unified "termination controller" in karpenter, so its cordon+drain logic is identical to other forms of termination (e.g. expiry).

Every node we terminate undergoes the following process:

  1. Taint the node
  2. Identify evictable pods
  3. Pods in "Succeeded" or "Failed" are ignored.
  4. Pods that are "stuck terminating" (i.e. beyond deletion timestamp) are ignored
  5. Pods that tolerate Unschedulable=NoSchedule are ignored, since it would trigger an eviction loop

It's not clear to me which of these steps you're falling under, but I don't believe tainting the node would solve your problem. We should be issuing an evict for any running Consul pod and allowing it to clean up in its GracefulTerminationPeriod. Have you been able to observe this in action for more details?

The Consul Agent pods are part of a DaemonSet and therefore won't be explicitly evicted.

In the specific cases I've seen the node is considered empty because the DaemonSet pods are filtered out
So they jump straight to being deleted.

But even if the node were being replaced Karpenter won't explicitly evict DaemonSet pods, a very similar problem to the kubectl drain issue I linked to.
The 'classic' cluster autoscaler also suffers from this issue

Just deleting the node might be fine?
Does the kubelet send a termination signal to all the pods when a node is deleted? I can't seem to find out if this is true or not.
Or is there a race and Karpenter is then terminating the EC2 instance before the pods can be gracefully shut down?

Applying a NoExecute taint with a custom key to the node before deleting it ensures the Consul pods are gracefully terminated before the node is removed from the Kubernetes cluster and before the EC2 instance is terminated.

On second glance I have Kubernetes nodes created by Karpenter that have correctly and gracefully left the Consul cluster, but they were nodes that did actual work.
And other Karpenter provisioned Kubernetes nodes that are failed in the Consul cluster, but they belong to nodes that had a lifetime of less than 2 minutes (separate problem, i think solved by increasing the batch duration).

@bwagner5
Copy link
Contributor

I was about to recommend using K8s graceful node shutdown to give your consul pods time to deregister. However, it appears it doesn't work with the version of systemd (219) shipped with Amazon Linux 2 (kubernetes/kubernetes#107043 (comment)).

Seems like Graceful node shutdown would be the way to go for this type of issue though if we can get systemd updated or when AL2022 is supported. I believe it does work with Ubuntu.

@ellistarn
Copy link
Contributor

The Consul Agent pods are part of a DaemonSet and therefore won't be explicitly evicted.

Do the consul agents tolerate NotReady/Unreachable taints? If you remove that toleration, then things should just work. Karpenter only looks at pods that reschedule, not at whether or not its owned by a daemonset.

@hamishforbes
Copy link
Author

I was about to recommend using K8s graceful node shutdown to give your consul pods time to deregister. However, it appears it doesn't work with the version of systemd (219) shipped with Amazon Linux 2 (kubernetes/kubernetes#107043 (comment)).

Seems like Graceful node shutdown would be the way to go for this type of issue though if we can get systemd updated or when AL2022 is supported. I believe it does work with Ubuntu.

Oh, interesting! That does look like the more correct solution. Tainting is definitely a hack/workaround.

In addition to having a systemd version that actually works with this the shutdownGracePeriod would need to be changed from its default 0.
Looks like it is supported via the kubelet config file but its not included as one of the option you can set via the Provisioner CRD?
So the only option currently would be messing around with the UserData in the AWSNodeTemplate?

Do the consul agents tolerate NotReady/Unreachable taints? If you remove that toleration, then things should just work. Karpenter only looks at pods that reschedule, not at whether or not its owned by a daemonset.

Yeah daemonsets magically tolerate taints for unschedulable / not-ready etc.

The tolerations aren't in the spec for the DS but are for the pods, which is why the NTH can add a custom taint that isn't tolerated.

> kubectl get ds consul-consul-client -oyaml | ggrep -A 5 tolerations
      tolerations:
      - key: eck_cluster
        operator: Exists
      - key: prometheus
        operator: Exists
      volumes:
> kubectl get pod consul-consul-client-vp2ph -oyaml | grep -A 26 tolerations
  tolerations:
  - key: eck_cluster
    operator: Exists
  - key: prometheus
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/pid-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/network-unavailable
    operator: Exists
  volumes:

@bwagner5
Copy link
Contributor

bwagner5 commented Oct 24, 2022

Oh, interesting! That does look like the more correct solution. Tainting is definitely a hack/workaround.

In addition to having a systemd version that actually works with this the shutdownGracePeriod would need to be changed from its default 0.
Looks like it is supported via the kubelet config file but its not included as one of the option you can set via the Provisioner CRD?
So the only option currently would be messing around with the UserData in the AWSNodeTemplate?

It's pretty trivial to plumb the config through the AWSNodeTemplate, but we'll probably wait on that until the EKS Optimized AMI supports the configuration which will probably be when we migrate to AL2022. I don't believe Bottlerocket supports it either.

It's pretty easy to patch it within your user-data in the AWSNodeTemplate if you wanted to try it with Ubuntu or build your own AMI with an updated Systemd.

I think this would work (although haven't tested yet):

apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: graceful-shutdown
spec:
  amiFamily: Ubuntu
  subnetSelector:
    karpenter.sh/discovery: my-cluster
  securityGroupSelector:
    karpenter.sh/discovery: my-cluster
  userData: |
    MIME-Version: 1.0
    Content-Type: multipart/mixed; boundary="BOUNDARY"

    --BOUNDARY
    Content-Type: text/x-shellscript; charset="us-ascii"

    #!/bin/bash
    echo "$(jq '.shutdownGracePeriod="2m"' /etc/kubernetes/kubelet/kubelet-config.json)" > /etc/kubernetes/kubelet/kubelet-config.json

    --BOUNDARY--

@ellistarn ellistarn added the v1 Issues requiring resolution by the v1 milestone label Apr 18, 2023
@sftim
Copy link

sftim commented Jun 5, 2023

I think we'd also like to taint nodes the moment that Karpenter thinks those nodes have become eligible for consolidation.

This lets us quickly untaint them if we see unschedulable Pods that we think might fit there. Otherwise, we'd leave the taint in place through the node drain, shutdown and termination.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 30, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 29, 2024
@jmdeal
Copy link
Member

jmdeal commented Mar 12, 2024

/remove-lifecycle rotten

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 10, 2024
@jmdeal
Copy link
Member

jmdeal commented Jun 13, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 13, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 11, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. v1 Issues requiring resolution by the v1 milestone
Projects
None yet
Development

No branches or pull requests

8 participants