Taint nodes before deletion #621

hamishforbes · 2022-10-05T00:18:03Z

Tell us about your request

When removing nodes due to consolidation I would like to be able to apply a taint to the node before it is removed.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Reason for this is to be able to gracefully stop DaemonSet pods, see related issues below

I have consul agents running on nodes via DaemonSet, these agents join the cluster.
If they are just killed then they sit around in the cluster as failed, if the pod is given a stop signal then it will gracefully leave the cluster and then exit.

When a node is just deleted it leaves a bunch of hanging agents in my Consul cluster.
Applying a NoExecute taint prior to deletion will evict those pods.

System DaemonSets (e.g. Kube-proxy) tolerate all taints and so this won't evict those.

Are you currently working around this issue?

Without Karpenter nodes are generally only removed
a) Manually, in which case I manually taint the node with a noExecute taint
b) By the node-termination-handler which is configured to add a taint as well

With Karpenter... well the workaround is to manually clear out failed nodes from my Consul cluster or get this feature added!

Additional Context

aws/aws-node-termination-handler#273
kubernetes/kubernetes#75482

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

ellistarn · 2022-10-05T22:24:01Z

When a node is just deleted it leaves a bunch of hanging agents in my Consul cluster.

This is a bit confusing to me -- the term cluster appears to mean consul cluster, but isn't always disambiguated from kubernetes cluster. If I understand correctly, you don't have hanging pods in your Kubernetes cluster, but you do have hanging agents in your Consul cluster. Is it correct that the way for Consul agents to clean up is with termination logic in the pod that hosts it?

Consolidation relies on a unified "termination controller" in karpenter, so its cordon+drain logic is identical to other forms of termination (e.g. expiry).

Every node we terminate undergoes the following process:

Taint the node
Identify evictable pods
Pods in "Succeeded" or "Failed" are ignored.
Pods that are "stuck terminating" (i.e. beyond deletion timestamp) are ignored
Pods that tolerate Unschedulable=NoSchedule are ignored, since it would trigger an eviction loop

It's not clear to me which of these steps you're falling under, but I don't believe tainting the node would solve your problem. We should be issuing an evict for any running Consul pod and allowing it to clean up in its GracefulTerminationPeriod. Have you been able to observe this in action for more details?

hamishforbes · 2022-10-06T00:39:10Z

This is a bit confusing to me -- the term cluster appears to mean consul cluster, but isn't always disambiguated from kubernetes cluster. If I understand correctly, you don't have hanging pods in your Kubernetes cluster, but you do have hanging agents in your Consul cluster. Is it correct that the way for Consul agents to clean up is with termination logic in the pod that hosts it?

Yes correct, sorry should've been clearer when i was referring to the kube cluster and when the consul cluster. Overloaded terms!

Consolidation relies on a unified "termination controller" in karpenter, so its cordon+drain logic is identical to other forms of termination (e.g. expiry).

Every node we terminate undergoes the following process:

Taint the node

Identify evictable pods

Pods in "Succeeded" or "Failed" are ignored.

Pods that are "stuck terminating" (i.e. beyond deletion timestamp) are ignored

Pods that tolerate Unschedulable=NoSchedule are ignored, since it would trigger an eviction loop

It's not clear to me which of these steps you're falling under, but I don't believe tainting the node would solve your problem. We should be issuing an evict for any running Consul pod and allowing it to clean up in its GracefulTerminationPeriod. Have you been able to observe this in action for more details?

The Consul Agent pods are part of a DaemonSet and therefore won't be explicitly evicted.

In the specific cases I've seen the node is considered empty because the DaemonSet pods are filtered out
So they jump straight to being deleted.

But even if the node were being replaced Karpenter won't explicitly evict DaemonSet pods, a very similar problem to the kubectl drain issue I linked to.
The 'classic' cluster autoscaler also suffers from this issue

Just deleting the node might be fine?
Does the kubelet send a termination signal to all the pods when a node is deleted? I can't seem to find out if this is true or not.
Or is there a race and Karpenter is then terminating the EC2 instance before the pods can be gracefully shut down?

Applying a NoExecute taint with a custom key to the node before deleting it ensures the Consul pods are gracefully terminated before the node is removed from the Kubernetes cluster and before the EC2 instance is terminated.

On second glance I have Kubernetes nodes created by Karpenter that have correctly and gracefully left the Consul cluster, but they were nodes that did actual work.
And other Karpenter provisioned Kubernetes nodes that are failed in the Consul cluster, but they belong to nodes that had a lifetime of less than 2 minutes (separate problem, i think solved by increasing the batch duration).

bwagner5 · 2022-10-12T15:59:12Z

I was about to recommend using K8s graceful node shutdown to give your consul pods time to deregister. However, it appears it doesn't work with the version of systemd (219) shipped with Amazon Linux 2 (kubernetes/kubernetes#107043 (comment)).

Seems like Graceful node shutdown would be the way to go for this type of issue though if we can get systemd updated or when AL2022 is supported. I believe it does work with Ubuntu.

ellistarn · 2022-10-12T18:50:08Z

The Consul Agent pods are part of a DaemonSet and therefore won't be explicitly evicted.

Do the consul agents tolerate NotReady/Unreachable taints? If you remove that toleration, then things should just work. Karpenter only looks at pods that reschedule, not at whether or not its owned by a daemonset.

hamishforbes · 2022-10-12T20:38:10Z

I was about to recommend using K8s graceful node shutdown to give your consul pods time to deregister. However, it appears it doesn't work with the version of systemd (219) shipped with Amazon Linux 2 (kubernetes/kubernetes#107043 (comment)).

Seems like Graceful node shutdown would be the way to go for this type of issue though if we can get systemd updated or when AL2022 is supported. I believe it does work with Ubuntu.

Oh, interesting! That does look like the more correct solution. Tainting is definitely a hack/workaround.

In addition to having a systemd version that actually works with this the shutdownGracePeriod would need to be changed from its default 0.
Looks like it is supported via the kubelet config file but its not included as one of the option you can set via the Provisioner CRD?
So the only option currently would be messing around with the UserData in the AWSNodeTemplate?

Do the consul agents tolerate NotReady/Unreachable taints? If you remove that toleration, then things should just work. Karpenter only looks at pods that reschedule, not at whether or not its owned by a daemonset.

Yeah daemonsets magically tolerate taints for unschedulable / not-ready etc.

The tolerations aren't in the spec for the DS but are for the pods, which is why the NTH can add a custom taint that isn't tolerated.

> kubectl get ds consul-consul-client -oyaml | ggrep -A 5 tolerations
      tolerations:
      - key: eck_cluster
        operator: Exists
      - key: prometheus
        operator: Exists
      volumes:

> kubectl get pod consul-consul-client-vp2ph -oyaml | grep -A 26 tolerations
  tolerations:
  - key: eck_cluster
    operator: Exists
  - key: prometheus
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/pid-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/network-unavailable
    operator: Exists
  volumes:

bwagner5 · 2022-10-24T19:57:45Z

Oh, interesting! That does look like the more correct solution. Tainting is definitely a hack/workaround.

In addition to having a systemd version that actually works with this the shutdownGracePeriod would need to be changed from its default 0.
Looks like it is supported via the kubelet config file but its not included as one of the option you can set via the Provisioner CRD?
So the only option currently would be messing around with the UserData in the AWSNodeTemplate?

It's pretty trivial to plumb the config through the AWSNodeTemplate, but we'll probably wait on that until the EKS Optimized AMI supports the configuration which will probably be when we migrate to AL2022. I don't believe Bottlerocket supports it either.

It's pretty easy to patch it within your user-data in the AWSNodeTemplate if you wanted to try it with Ubuntu or build your own AMI with an updated Systemd.

I think this would work (although haven't tested yet):

apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: graceful-shutdown
spec:
  amiFamily: Ubuntu
  subnetSelector:
    karpenter.sh/discovery: my-cluster
  securityGroupSelector:
    karpenter.sh/discovery: my-cluster
  userData: |
    MIME-Version: 1.0
    Content-Type: multipart/mixed; boundary="BOUNDARY"

    --BOUNDARY
    Content-Type: text/x-shellscript; charset="us-ascii"

    #!/bin/bash
    echo "$(jq '.shutdownGracePeriod="2m"' /etc/kubernetes/kubelet/kubelet-config.json)" > /etc/kubernetes/kubelet/kubelet-config.json

    --BOUNDARY--

sftim · 2023-06-05T17:00:14Z

I think we'd also like to taint nodes the moment that Karpenter thinks those nodes have become eligible for consolidation.

This lets us quickly untaint them if we see unschedulable Pods that we think might fit there. Otherwise, we'd leave the taint in place through the node drain, shutdown and termination.

k8s-triage-robot · 2024-01-30T20:29:33Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-02-29T21:27:18Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

jmdeal · 2024-03-12T16:13:47Z

/remove-lifecycle rotten

k8s-triage-robot · 2024-06-10T16:36:36Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jmdeal · 2024-06-13T04:34:27Z

/remove-lifecycle stale

k8s-triage-robot · 2024-09-11T04:39:31Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-10-11T05:17:22Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

hamishforbes added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 5, 2022

ellistarn added the v1 Issues requiring resolution by the v1 milestone label Apr 18, 2023

ellistarn mentioned this issue Sep 12, 2023

Karpenter should cordon a node once it expires aws/karpenter-provider-aws#4613

Closed

billrayburn assigned njtran Sep 20, 2023

njtran transferred this issue from aws/karpenter-provider-aws Oct 20, 2023

This was referenced Oct 20, 2023

Mega Issue: Node Disruption Lifecycle Taints #624

Open

support to add delay in node termination to honor ELB connection draining interval aws/karpenter-provider-aws#4673

Open

sadath-12 mentioned this issue Oct 21, 2023

feat: added NoExecute taint to terminating Nodes #626

Closed

njtran mentioned this issue Nov 1, 2023

Termination leaks pods that tolerate the disruption taint that Karpenter will not evict #655

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 30, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 29, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 12, 2024

njtran mentioned this issue Mar 21, 2024

DaemonSets shutdown before waiting for deployments to gracefully shutdown aws/karpenter-provider-aws#5858

Closed

robertoriv mentioned this issue Mar 25, 2024

karpenter evicts DaemonSet pods before all service pods are terminated #1133

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 10, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 13, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 11, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Taint nodes before deletion #621

Taint nodes before deletion #621

hamishforbes commented Oct 5, 2022

ellistarn commented Oct 5, 2022

hamishforbes commented Oct 6, 2022

bwagner5 commented Oct 12, 2022

ellistarn commented Oct 12, 2022

hamishforbes commented Oct 12, 2022

bwagner5 commented Oct 24, 2022 •

edited

Loading

sftim commented Jun 5, 2023

k8s-triage-robot commented Jan 30, 2024

k8s-triage-robot commented Feb 29, 2024

jmdeal commented Mar 12, 2024

k8s-triage-robot commented Jun 10, 2024

jmdeal commented Jun 13, 2024

k8s-triage-robot commented Sep 11, 2024

k8s-triage-robot commented Oct 11, 2024

Taint nodes before deletion #621

Taint nodes before deletion #621

Comments

hamishforbes commented Oct 5, 2022

Tell us about your request

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Are you currently working around this issue?

Additional Context

Attachments

Community Note

ellistarn commented Oct 5, 2022

hamishforbes commented Oct 6, 2022

bwagner5 commented Oct 12, 2022

ellistarn commented Oct 12, 2022

hamishforbes commented Oct 12, 2022

bwagner5 commented Oct 24, 2022 • edited Loading

sftim commented Jun 5, 2023

k8s-triage-robot commented Jan 30, 2024

k8s-triage-robot commented Feb 29, 2024

jmdeal commented Mar 12, 2024

k8s-triage-robot commented Jun 10, 2024

jmdeal commented Jun 13, 2024

k8s-triage-robot commented Sep 11, 2024

k8s-triage-robot commented Oct 11, 2024

bwagner5 commented Oct 24, 2022 •

edited

Loading