-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tries to schedule on fargate nodes #183
Comments
As an example, this is a 3-normal-node and 2-fargate-node cluster. It will never recover from this situation:
|
I saw that you mentioned that this breaks the add-on upgrade on your cluster as part of this PR: aws-observability/helm-charts#41
Can you elaborate more? Did you see the add-on in a "Degraded" status when you tried upgrading to 1.7.0? |
It actually goes to a "Failed" state in the AWS overview page (addon tab on the cluster details). On the cluster itself, the deployment never completes, as it isn't possible to run the pods on the fargate nodes. See the output of kubectl in the comment above: 2 pending, 3 running, both for fluent-bit and cloudwatch-agent. Removing the addon and re-installing version 1.6.0 fixes this. This is how I currently solved it. |
We are having the same issue with version 1.7.0 and we partially solved it by setting tolerations: [] in the addon configuration. However, this only prevents the "fluent-bit" pods from being scheduled, not the "cloudwatch-agent" ones. The result is the following:
EDIT: solved it by uninstalling and reinstalling the addon as suggested by @wonko. |
Hello, this issue has been fixed and released as part of amazon-cloudwatch-observability Helm Chart v1.9.0 |
In aws-observability/helm-charts#41 a default "all" toleration was added on all daemonsets. This results in a daemonset which can't roll out completely as the pods will never schedule on the fargate nodes (obviously, they have a taint
eks.amazonaws.com/compute-type=fargate:NoSchedule
, but this is ignored by the too-liberal-default-toleration).This results in an addon upgrade which doesn't work (daemonset keeps having pending pods).
I believe the operator should set correct tolerations (or the default tolerations in the helm charts should be good defaults for a normal EKS cluster - then this issue should move to https://github.com/aws-observability/helm-charts).
The text was updated successfully, but these errors were encountered: