Calico-node pods being recreated every few seconds #4538

maximemoreillon · 2024-05-30T06:21:20Z

Hello,

Thank you very much for Microk8s. It is an awesome project that I have been enjoying for several years now. However I am recently facing the following issue

Summary

I am experiencing a strange behavior with Calico v 3.25.1 running on a 4 node Microk8s v1.30 cluster. The calico-node pods of each node keep being terminated and recreated after at most 30 seconds

NAME                                       READY   STATUS     RESTARTS        AGE
calico-kube-controllers-749f88db4d-rctkf   1/1     Running    0               4h28m
calico-node-2fxmj                          0/1     Init:0/2   0               3s
calico-node-8p84q                          0/1     Init:0/2   0               2s
calico-node-bl6hq                          0/1     Init:1/2   0               3s
calico-node-g2gsx                          0/1     Running    0               7s

The nodes report network unavailability:

  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   True    Thu, 30 May 2024 13:36:08 +0900   Thu, 30 May 2024 13:36:08 +0900   CalicoIsDown                 Calico is shutting down on this node
  MemoryPressure       False   Thu, 30 May 2024 13:33:10 +0900   Thu, 30 May 2024 02:36:59 +0900   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Thu, 30 May 2024 13:33:10 +0900   Thu, 30 May 2024 02:36:59 +0900   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Thu, 30 May 2024 13:33:10 +0900   Thu, 30 May 2024 02:36:59 +0900   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Thu, 30 May 2024 13:33:10 +0900   Thu, 30 May 2024 02:36:59 +0900   KubeletReady                 kubelet is posting ready status

Looking at the logs of the calico-node pods, for the short time they are running, does not reveal any immediately clear problem

What Should Happen Instead?

calico-node pods should be running continuously

Reproduction Steps

I had no such issues when using Microk8s v1.24. From there, my changes to the cluster include:

Incremental update Microk8s v1.24 to v1.30
Installing Longhorn v1.61
Adding 10.1.0.0/16,10.152.183.0/24,*.svc,*.cluster.local to the NO_PROXY environment variable for Longhorn to work
Enabled hugepages on each node

Environment

Microk8s v1.30 (i.e. Kuberntes v1.30) with DNS addon enabled
4 node cluster composed of physical machines with Ubuntu 22.04
Calico version: 3.25.1

The cluster runs on-premise, behind a corporate proxy. Environment variables are set accordingly in /etc/environment of each node, as per recommended here

HTTP_PROXY=http://172.16.105.13:8118
HTTPS_PROXY=http://172.16.105.13:8118
NO_PROXY=localhost,127.0.0.1,172.16.98.148,172.16.98.150,172.16.99.29,172.16.106.87,november,haspc,20khaa520,24khaa344,10.1.0.0/16,10.152.183.0/24,*.svc,*.cluster.local

Similarily, containerd environment variables are set in /var/snap/microk8s/current/args/containerd-env:

HTTPS_PROXY=http://172.16.105.13:8118
NO_PROXY=10.1.0.0/16,10.152.183.0/24,172.16.0.0/12,127.0.0.1,localhost

Probably unrelated, but Longhorn v1.6.1 is installed in the cluster using Helm.

I've got a similar cluster, composed of AWS EC2s that dot not require proxy settings and have no problems there.

Introspection Report

inspection-report-20240530_151748.tar.gz

Note: Inspection outputs the following error, as per #4361:

cp: cannot stat '/var/snap/microk8s/6876/var/kubernetes/backend/localnode.yaml': No such file or directory

Can you suggest a fix?

What I tried so far, unsuccessfully:

restarting the calico-node daemonset
restarting the calico-kube-controller deployment
restarting the nodes
updating calico as described here
Having each node leave and re-join the cluster fixed the problem for about a week but I am now experiencing the issue again

Are you interested in contributing with a fix?

I'm afraid I lack the technical knowledge to do so

The text was updated successfully, but these errors were encountered:

maximemoreillon · 2024-05-31T09:44:13Z

I brought one node for a reboot today for maintenance today and the calico-node pods started running properly.
Bringing the node back up immediately triggered the recreation cycle mentioned in this issue.

It appears that the calido-node pods only run properly if that of the aforementioned node is in Pending state

I then tried to remove the node from the cluster entirely, which removed its calido-node pod and triggered the recreation cycle for the pods of the remaining nodes again.

maximemoreillon · 2024-06-03T01:49:24Z

I found a deployment whose pod had been stuck pending state for a couple months.
Deleting it immediately solved the problems with calico-node pods.
As I didn't imagine those issues to be linked, I didn't check why the aforementioned pod was in pending state and am now unable to recreate the issue.
For anyone experiencing the same issue, check your pods and deal with any that is in pending state

romainrossi · 2024-08-27T15:27:50Z

I had a similar problem (calico pods restarting again and again) and solved the issue with the same fix : deleting all pods in the "pending" state.
Thank you for the hint @maximemoreillon !

maximemoreillon closed this as completed Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calico-node pods being recreated every few seconds #4538

Calico-node pods being recreated every few seconds #4538

maximemoreillon commented May 30, 2024 •

edited

Loading

maximemoreillon commented May 31, 2024 •

edited

Loading

maximemoreillon commented Jun 3, 2024

romainrossi commented Aug 27, 2024

Calico-node pods being recreated every few seconds #4538

Calico-node pods being recreated every few seconds #4538

Comments

maximemoreillon commented May 30, 2024 • edited Loading

Summary

What Should Happen Instead?

Reproduction Steps

Environment

Introspection Report

Can you suggest a fix?

Are you interested in contributing with a fix?

maximemoreillon commented May 31, 2024 • edited Loading

maximemoreillon commented Jun 3, 2024

romainrossi commented Aug 27, 2024

maximemoreillon commented May 30, 2024 •

edited

Loading

maximemoreillon commented May 31, 2024 •

edited

Loading