Skip to content

custom kubelet dir is not cleaned up on controller+worker nodes on reset #986

@byDimasik

Description

@byDimasik

Versions

k0sctl version: github.com/k0sproject/k0sctl v0.26.1-0.20251016074538-d8718bed3a0b
k0s version: v1.32.6+k0s.0

Context

Must be a regression from #904 and #892

k0sctl is used as a vendored dependency in Go code.

What happened

After redeploying a k0s cluster on the nodes where it was previously deployed and reset with k0sctl reset, I am unable to get the logs of certain pods:

% k logs -n mke mke-operator-controller-manager-67b5c65c9-j6pfr
Error from server: Get "https://172.31.0.191:10250/containerLogs/mke/mke-operator-controller-manager-67b5c65c9-j6pfr/manager": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes-ca")

I have 2 nodes in my cluster: 1 controller+worker and 1 worker. Both nodes use a custom kubelet dir /var/lib/kubelet.

The error above only appears for pods that are running on the controller node. Pods from the worker node show logs just fine.

After investigating, I found that the kubelet server cert doesn't match the cluster CA:

# openssl verify -CAfile /var/lib/k0s/pki/ca.crt /var/lib/kubelet/pki/kubelet-server-current.pem 
O = system:nodes, CN = system:node:ip-172-31-0-73.ec2.internal
error 30 at 0 depth lookup: authority and subject key identifier mismatch
error /var/lib/kubelet/pki/kubelet-server-current.pem: verification failed

The same check on the worker node passes

# openssl verify -CAfile /var/lib/k0s/pki/ca.crt /var/lib/kubelet/pki/kubelet-server-current.pem 
/var/lib/kubelet/pki/kubelet-server-current.pem: OK

After inspecting the kubelet PKI dir, I found that on the controller+worker node, the kubelet server cert was from the previous k0s installation:

# ls -al /var/lib/kubelet/pki/
total 28
drwxr-xr-x 2 root root 4096 Dec  2 18:35 .
drwxr-xr-x 9 root root 4096 Nov 21 02:58 ..
-rw------- 1 root root 1143 Nov 21 02:58 kubelet-client-2025-11-21-02-58-35.pem
-rw------- 1 root root 1143 Nov 22 03:02 kubelet-client-2025-11-22-03-02-38.pem
-rw------- 1 root root 1143 Dec  2 16:56 kubelet-client-2025-12-02-16-56-03.pem
-rw------- 1 root root 1143 Dec  2 18:35 kubelet-client-2025-12-02-18-35-41.pem
lrwxrwxrwx 1 root root   59 Dec  2 18:35 kubelet-client-current.pem -> /var/lib/kubelet/pki/kubelet-client-2025-12-02-18-35-41.pem
-rw------- 1 root root 1208 Nov 21 02:58 kubelet-server-2025-11-21-02-58-40.pem
lrwxrwxrwx 1 root root   59 Nov 21 02:58 kubelet-server-current.pem -> /var/lib/kubelet/pki/kubelet-server-2025-11-21-02-58-40.pem

Here, the client cert was reissued and the symlink was updated, but the server cert was reused from the old deployment.
On the worker node the same directory looks like this:

# ls -al /var/lib/kubelet/pki/
total 16
drwxr-xr-x 2 root root 4096 Dec  2 16:56 .
drwxr-xr-x 9 root root 4096 Dec  2 16:56 ..
-rw------- 1 root root 1143 Dec  2 16:56 kubelet-client-2025-12-02-16-56-05.pem
lrwxrwxrwx 1 root root   59 Dec  2 16:56 kubelet-client-current.pem -> /var/lib/kubelet/pki/kubelet-client-2025-12-02-16-56-05.pem
-rw------- 1 root root 1208 Dec  2 16:56 kubelet-server-2025-12-02-16-56-13.pem
lrwxrwxrwx 1 root root   59 Dec  2 16:56 kubelet-server-current.pem -> /var/lib/kubelet/pki/kubelet-server-2025-12-02-16-56-13.pem

Somehow, k0sctl reset cleaned up the custom kubelet dir on the worker node, but not on the controller+worker node. As a result, the following installation of k0s ended up reusing the remaining kubelet server certs on the controller+worker node.

The systemd units on both nodes include the custom kubelet dir flag

k0scontroller service
[Unit]
Description=k0s - Zero Friction Kubernetes
Documentation=https://docs.k0sproject.io
ConditionFileIsExecutable=/usr/local/bin/k0s

After=network-online.target 
Wants=network-online.target 

[Service]
StartLimitInterval=5
StartLimitBurst=10
ExecStart=/usr/local/bin/k0s controller --config=/etc/k0s/k0s.yaml --data-dir=/var/lib/k0s --debug=true --disable-components=konnectivity-server,endpoint-reconciler --enable-metrics-scraper=true --enable-worker=true --kubelet-extra-args=--node-ip=172.31.0.73 --kubelet-root-dir=/var/lib/kubelet --labels=mke/version=dev --profile=mke-default-manager

RestartSec=10
Delegate=yes
KillMode=process
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
LimitNOFILE=999999
Restart=always

[Install]
WantedBy=multi-user.target
k0sworker service
[Unit]
cat: /etc/systemd/system/k0scontroller.service: No such file or directory
[Unit]: command not found
root@ip-172-31-0-48:/home/ubuntu# cat /etc/systemd/system/k0sworker.service 
[Unit]
Description=k0s - Zero Friction Kubernetes
Documentation=https://docs.k0sproject.io
ConditionFileIsExecutable=/usr/local/bin/k0s

After=network-online.target 
Wants=network-online.target 

[Service]
StartLimitInterval=5
StartLimitBurst=10
ExecStart=/usr/local/bin/k0s worker --data-dir=/var/lib/k0s --debug=true --kubelet-extra-args=--node-ip=172.31.0.48 --kubelet-root-dir=/var/lib/kubelet --profile=mke-default-worker --token-file=/etc/k0s/k0stoken

RestartSec=10
Delegate=yes
KillMode=process
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
LimitNOFILE=999999
Restart=always

[Install]
WantedBy=multi-user.target

Steps to reproduce

  1. Deploy k0s cluster with 1 controller+worker and 1 worker node. Set --kubelet-root-dir=/var/lib/kubelet for each node.

Use vendred k0sctl in Go code as shown in #904

Example k0sctl cluster config
apiVersion: k0sctl.k0sproject.io/v1beta1
kind: Cluster
metadata:
  name: mke
  user: ""
spec:
  hosts:
  - ssh:
      address: 3.237.0.80
      user: ubuntu
      port: 22
      keyPath: /Users/dshishliannikov/mirantis/mke/deployments/mke3/ssh_keys/mke3.pem
    role: controller+worker
    installFlags:
    - --kubelet-root-dir=/var/lib/kubelet
    - --data-dir=/var/lib/k0s
    - --debug=true
    - --enable-metrics-scraper=true
    - --disable-components=konnectivity-server,endpoint-reconciler
    - --labels=mke/version=dev
    - --profile=mke-default-manager
  - ssh:
      address: 3.227.235.72
      user: ubuntu
      port: 22
      keyPath: /Users/dshishliannikov/mirantis/mke/deployments/mke3/ssh_keys/mke3.pem
    role: worker
    installFlags:
    - --debug=true
    - --kubelet-root-dir=/var/lib/kubelet
    - --data-dir=/var/lib/k0s
    - --profile=mke-default-worker
  k0s:
    version: v1.32.6+k0s.0
    config:
      apiVersion: k0s.k0sproject.io/v1beta1
      kind: Cluster
      metadata:
        name: mke
      spec:
        api:
          externalAddress: 80xa1p-mke4-lb-6972b00c98d1bfe2.elb.us-east-1.amazonaws.com
          extraArgs:
            authentication-config: /var/lib/k0s/oidc-config.yaml
            encryption-provider-config: /var/lib/k0s/encryption.cfg
            profiling: "false"
            request-timeout: 1m0s
            service-node-port-range: 32768-35535
          sans:
          - 80xa1p-mke4-lb-6972b00c98d1bfe2.elb.us-east-1.amazonaws.com
        controllerManager:
          extraArgs:
            profiling: "false"
            terminated-pod-gc-threshold: "12500"
        extensions:
          helm:
            charts:
            - chartname: oci://ghcr.io/mirantiscontainers/mke4-ucpauthz
              name: ucpauthz
              namespace: mke
              order: 3
              timeout: 10m0s
              values: "disabled: false\nexempt:\n  namespaces:\n  \n  users:\n  \n
                \   - system:serviceaccount:calico-apiserver:calico-apiserver\n  \n
                \   - system:serviceaccount:calico-system:calico-cni-plugin\n  \n
                \   - system:serviceaccount:calico-system:calico-kube-controllers\n
                \ \n    - system:serviceaccount:calico-system:calico-node\n  \n    -
                system:serviceaccount:calico-system:calico-typha\n  \n    - system:serviceaccount:calico-system:csi-node-driver\n
                \ \n    - system:serviceaccount:calico-system:default\n  \n    - system:serviceaccount:tigera-operator:default\n
                \ \n    - system:serviceaccount:calico-apiserver:default\n  \n    -
                system:serviceaccount:tigera-operator:tigera-operator\n  "
              version: 0.1.0
            - chartname: oci://ghcr.io/mirantiscontainers/mke4-tigera-operator-crds
              name: mke4-tigera-operator-crds
              namespace: tigera-operator
              order: 4
              timeout: 10m0s
              version: v3.30.200
            - chartname: oci://ghcr.io/mirantiscontainers/mke4-tigera-operator
              name: tigera-operator
              namespace: tigera-operator
              order: 4
              timeout: 10m0s
              values: |-
                kubeletVolumePluginPath: /var/lib/kubelet
                installation:
                  registry: ghcr.io/mirantiscontainers/
                  logging:
                    cni:
                      logSeverity: Info
                  cni:
                    type: Calico
                  kubeletVolumePluginPath: /var/lib/kubelet
                  calicoNetwork:
                    linuxDataplane: Iptables
                    ipPools:
                    - cidr: 192.168.0.0/16
                      encapsulation: VXLAN
                      blockSize: 26
                resources:
                  requests:
                    cpu: 250m
                tigeraOperator:
                  version: v1.38.3
                  registry: ghcr.io/mirantiscontainers/
                defaultFelixConfiguration:
                  enabled: true
                  logSeveritySys: Info
                  ipsecLogLevel: Info
                  bpfLogLevel: Info
                  vxlanPort: 4789
                  vxlanVNI: 10000
              version: v3.30.200
            - chartname: oci://registry.mirantis.com/k0rdent-enterprise/charts/k0rdent-enterprise
              name: kcm
              namespace: k0rdent
              order: 6
              timeout: 10m0s
              values: |
                {"velero":{"enabled":false,"image":{"repository":"registry.mirantis.com/k0rdent-enterprise/velero/velero"}},"cert-manager":{"clusterResourceNamespace":"mke","image":{"repository":"registry.mirantis.com/k0rdent-enterprise/jetstack/cert-manager-controller"},"webhook":{"image":{"repository":"registry.mirantis.com/k0rdent-enterprise/jetstack/cert-manager-webhook"},"tolerations":[{"key":"node-role.kubernetes.io/master","operator":"Exists","effect":"NoSchedule"}]},"cainjector":{"image":{"repository":"registry.mirantis.com/k0rdent-enterprise/jetstack/cert-manager-cainjector"},"tolerations":[{"key":"node-role.kubernetes.io/master","operator":"Exists","effect":"NoSchedule"}]},"startupapicheck":{"image":{"repository":"registry.mirantis.com/k0rdent-enterprise/jetstack/cert-manager-startupapicheck"},"tolerations":[{"key":"node-role.kubernetes.io/master","operator":"Exists","effect":"NoSchedule"}]},"tolerations":[{"key":"node-role.kubernetes.io/master","operator":"Exists","effect":"NoSchedule"}]},"controller":{"templatesRepoURL":"oci://registry.mirantis.com/k0rdent-enterprise/charts","globalRegistry":"registry.mirantis.com/k0rdent-enterprise","tolerations":[{"key":"node-role.kubernetes.io/master","operator":"Exists","effect":"NoSchedule"}]},"image":{"repository":"registry.mirantis.com/k0rdent-enterprise/kcm-controller"},"flux2":{"helmController":{"image":"registry.mirantis.com/k0rdent-enterprise/fluxcd/helm-controller","tolerations":[{"key":"node-role.kubernetes.io/master","operator":"Exists","effect":"NoSchedule"}]},"sourceController":{"image":"registry.mirantis.com/k0rdent-enterprise/fluxcd/source-controller","tolerations":[{"key":"node-role.kubernetes.io/master","operator":"Exists","effect":"NoSchedule"}]},"cli":{"image":"registry.mirantis.com/k0rdent-enterprise/fluxcd/flux-cli","tolerations":[{"key":"node-role.kubernetes.io/master","operator":"Exists","effect":"NoSchedule"}]}},"cluster-api-operator":{"image":{"manager":{"repository":"registry.mirantis.com/k0rdent-enterprise/capi-operator/cluster-api-operator"}}},"k0rdent-ui":{"enabled":true,"image":{"repository":"registry.mirantis.com/k0rdent-enterprise/k0rdent-ui"},"tolerations":[{"key":"node-role.kubernetes.io/master","operator":"Exists","effect":"NoSchedule"}]}}
              version: 1.1.0
        images:
          calico:
            cni:
              image: quay.io/k0sproject/calico-cni
              version: v3.29.4-0
            kubecontrollers:
              image: quay.io/k0sproject/calico-kube-controllers
              version: v3.29.4-0
            node:
              image: quay.io/k0sproject/calico-node
              version: v3.29.4-0
          coredns:
            image: quay.io/k0sproject/coredns
            version: 1.12.2
          default_pull_policy: IfNotPresent
          konnectivity:
            image: quay.io/k0sproject/apiserver-network-proxy-agent
            version: v0.31.0
          kubeproxy:
            image: quay.io/k0sproject/kube-proxy
            version: v1.32.6
          kuberouter:
            cni:
              image: quay.io/k0sproject/kube-router
              version: v2.4.1-iptables1.8.9-0
            cniInstaller:
              image: quay.io/k0sproject/cni-node
              version: 1.3.0-k0s.0
          metricsserver:
            image: registry.k8s.io/metrics-server/metrics-server
            version: v0.7.2
          pause:
            image: registry.k8s.io/pause
            version: "3.9"
          pushgateway:
            image: quay.io/k0sproject/pushgateway-ttl
            version: 1.4.0-k0s.0
          repository: ghcr.io/mirantiscontainers
        network:
          clusterDomain: cluster.local
          controlPlaneLoadBalancing:
            enabled: false
          dualStack:
            enabled: false
          kubeProxy:
            iptables:
              minSyncPeriod: 0s
              syncPeriod: 0s
            ipvs:
              minSyncPeriod: 0s
              syncPeriod: 0s
              tcpFinTimeout: 0s
              tcpTimeout: 0s
              udpTimeout: 0s
            metricsBindAddress: 0.0.0.0:10249
            mode: iptables
            nftables:
              minSyncPeriod: 0s
              syncPeriod: 0s
          kuberouter:
            autoMTU: true
            hairpin: Enabled
            metricsPort: 8080
          nodeLocalLoadBalancing:
            enabled: false
            envoyProxy:
              apiServerBindPort: 7443
              image:
                image: quay.io/k0sproject/envoy-distroless
                version: v1.31.5
              konnectivityServerBindPort: 7132
            type: EnvoyProxy
          podCIDR: 192.168.0.0/16
          provider: custom
          serviceCIDR: 10.96.0.0/16
        scheduler:
          extraArgs:
            bind-address: 127.0.0.1
            profiling: "false"
        storage:
          etcd: {}
          type: etcd
        telemetry:
          enabled: true
        workerProfiles:
        - name: mke-default-worker
          values:
            eventRecordQPS: 50
            kubeReserved:
              cpu: 50m
              ephemeral-storage: 500Mi
              memory: 300Mi
            maxPods: 110
            podPidsLimit: -1
            podsPerCore: 0
            protectKernelDefaults: false
            seccompDefault: false
        - name: mke-default-manager
          values:
            eventRecordQPS: 50
            kubeReserved:
              cpu: 250m
              ephemeral-storage: 4Gi
              memory: 2Gi
            maxPods: 110
            podPidsLimit: -1
            podsPerCore: 0
            protectKernelDefaults: false
            seccompDefault: false
  options:
    wait:
      enabled: false
    drain:
      enabled: false
      gracePeriod: 0s
      timeout: 0s
      force: false
      ignoreDaemonSets: false
      deleteEmptyDirData: false
      podSelector: ""
      skipWaitForDeleteTimeout: 0s
    concurrency:
      limit: 0
      workerDisruptionPercent: 0
      uploads: 0
    evictTaint:
      enabled: false
      taint: ""
      effect: ""
      controllerWorkers: false
  1. Reset the cluster with k0sctl reset
  2. Inspect /var/lib/kubelet on every node

The controller+worker node has the dir present with all the files:

# ls -al /var/lib/kubelet
total 44
drwxr-xr-x  9 root root 4096 Nov 21 02:58 .
drwxr-xr-x 43 root root 4096 Dec  2 18:53 ..
drwx------  2 root root 4096 Nov 21 02:58 checkpoints
-rw-------  1 root root   62 Nov 21 02:58 cpu_manager_state
drwxr-xr-x  2 root root 4096 Dec  2 18:35 device-plugins
-rw-------  1 root root   61 Nov 21 02:58 memory_manager_state
drwxr-xr-x  2 root root 4096 Dec  2 18:35 pki
drwxr-x---  3 root root 4096 Nov 21 02:58 plugins
drwxr-x---  2 root root 4096 Dec  2 18:36 plugins_registry
drwxr-x---  2 root root 4096 Dec  2 18:35 pod-resources
drwxr-x--- 39 root root 4096 Dec  2 18:43 pods

The worker node doesn't have the dir

#  ls -al /var/lib/kubelet
ls: cannot access '/var/lib/kubelet': No such file or directory

In the reset logs, I can see that the k0s reset command for the controller doesn't include the kubelet flag while the same command for the worker node does incluide the flag.

time="2025-12-02T13:52:55-05:00" level=info msg="==> Running phase: Reset workers"
...
time="2025-12-02T13:52:55-05:00" level=debug msg="[ssh] 3.227.235.72:22: resetting k0s..."
time="2025-12-02T13:52:55-05:00" level=debug msg="[ssh] 3.227.235.72:22: executing `sudo -- /usr/local/bin/k0s reset --data-dir=/var/lib/k0s --kubelet-root-dir=/var/lib/kubelet`"
...
time="2025-12-02T13:53:08-05:00" level=info msg="==> Running phase: Reset controllers"
...
time="2025-12-02T13:53:15-05:00" level=debug msg="[ssh] 3.237.0.80:22: resetting k0s..."
time="2025-12-02T13:53:15-05:00" level=debug msg="[ssh] 3.237.0.80:22: executing `sudo -- /usr/local/bin/k0s reset --data-dir=/var/lib/k0s`"
...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions