Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calico-node pods being recreated every few seconds #4538

Closed
maximemoreillon opened this issue May 30, 2024 · 3 comments
Closed

Calico-node pods being recreated every few seconds #4538

maximemoreillon opened this issue May 30, 2024 · 3 comments

Comments

@maximemoreillon
Copy link

maximemoreillon commented May 30, 2024

Hello,

Thank you very much for Microk8s. It is an awesome project that I have been enjoying for several years now. However I am recently facing the following issue

Summary

I am experiencing a strange behavior with Calico v 3.25.1 running on a 4 node Microk8s v1.30 cluster. The calico-node pods of each node keep being terminated and recreated after at most 30 seconds

NAME                                       READY   STATUS     RESTARTS        AGE
calico-kube-controllers-749f88db4d-rctkf   1/1     Running    0               4h28m
calico-node-2fxmj                          0/1     Init:0/2   0               3s
calico-node-8p84q                          0/1     Init:0/2   0               2s
calico-node-bl6hq                          0/1     Init:1/2   0               3s
calico-node-g2gsx                          0/1     Running    0               7s

The nodes report network unavailability:

  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   True    Thu, 30 May 2024 13:36:08 +0900   Thu, 30 May 2024 13:36:08 +0900   CalicoIsDown                 Calico is shutting down on this node
  MemoryPressure       False   Thu, 30 May 2024 13:33:10 +0900   Thu, 30 May 2024 02:36:59 +0900   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Thu, 30 May 2024 13:33:10 +0900   Thu, 30 May 2024 02:36:59 +0900   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Thu, 30 May 2024 13:33:10 +0900   Thu, 30 May 2024 02:36:59 +0900   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Thu, 30 May 2024 13:33:10 +0900   Thu, 30 May 2024 02:36:59 +0900   KubeletReady                 kubelet is posting ready status

Looking at the logs of the calico-node pods, for the short time they are running, does not reveal any immediately clear problem

What Should Happen Instead?

calico-node pods should be running continuously

Reproduction Steps

I had no such issues when using Microk8s v1.24. From there, my changes to the cluster include:

  • Incremental update Microk8s v1.24 to v1.30
  • Installing Longhorn v1.61
  • Adding 10.1.0.0/16,10.152.183.0/24,*.svc,*.cluster.local to the NO_PROXY environment variable for Longhorn to work
  • Enabled hugepages on each node

Environment

  • Microk8s v1.30 (i.e. Kuberntes v1.30) with DNS addon enabled
  • 4 node cluster composed of physical machines with Ubuntu 22.04
  • Calico version: 3.25.1

The cluster runs on-premise, behind a corporate proxy. Environment variables are set accordingly in /etc/environment of each node, as per recommended here

HTTP_PROXY=http://172.16.105.13:8118
HTTPS_PROXY=http://172.16.105.13:8118
NO_PROXY=localhost,127.0.0.1,172.16.98.148,172.16.98.150,172.16.99.29,172.16.106.87,november,haspc,20khaa520,24khaa344,10.1.0.0/16,10.152.183.0/24,*.svc,*.cluster.local

Similarily, containerd environment variables are set in /var/snap/microk8s/current/args/containerd-env:

HTTPS_PROXY=http://172.16.105.13:8118
NO_PROXY=10.1.0.0/16,10.152.183.0/24,172.16.0.0/12,127.0.0.1,localhost

Probably unrelated, but Longhorn v1.6.1 is installed in the cluster using Helm.

I've got a similar cluster, composed of AWS EC2s that dot not require proxy settings and have no problems there.

Introspection Report

inspection-report-20240530_151748.tar.gz

Note: Inspection outputs the following error, as per #4361:

cp: cannot stat '/var/snap/microk8s/6876/var/kubernetes/backend/localnode.yaml': No such file or directory

Can you suggest a fix?

What I tried so far, unsuccessfully:

  • restarting the calico-node daemonset
  • restarting the calico-kube-controller deployment
  • restarting the nodes
  • updating calico as described here
  • Having each node leave and re-join the cluster fixed the problem for about a week but I am now experiencing the issue again

Are you interested in contributing with a fix?

I'm afraid I lack the technical knowledge to do so

@maximemoreillon
Copy link
Author

maximemoreillon commented May 31, 2024

I brought one node for a reboot today for maintenance today and the calico-node pods started running properly.
Bringing the node back up immediately triggered the recreation cycle mentioned in this issue.

It appears that the calido-node pods only run properly if that of the aforementioned node is in Pending state

I then tried to remove the node from the cluster entirely, which removed its calido-node pod and triggered the recreation cycle for the pods of the remaining nodes again.

@maximemoreillon
Copy link
Author

I found a deployment whose pod had been stuck pending state for a couple months.
Deleting it immediately solved the problems with calico-node pods.
As I didn't imagine those issues to be linked, I didn't check why the aforementioned pod was in pending state and am now unable to recreate the issue.
For anyone experiencing the same issue, check your pods and deal with any that is in pending state

@romainrossi
Copy link

I had a similar problem (calico pods restarting again and again) and solved the issue with the same fix : deleting all pods in the "pending" state.
Thank you for the hint @maximemoreillon !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants