Skip to content

Commit

Permalink
doc(upgrade): add known issue for v1.3.1 to v1.3.2 upgrades (#635)
Browse files Browse the repository at this point in the history
For two-node cluster upgrades, the RKE2's agent load balancer on the
worker node will stop functioning when the management node's RKE2 is
under upgrade. This commit describes how to triage and recover from
this upgrade incident.

Signed-off-by: Zespre Chang <zespre.chang@suse.com>
Co-authored-by: Jillian <67180770+jillian-maroket@users.noreply.github.com>
  • Loading branch information
starbops and jillian-maroket committed Sep 5, 2024
1 parent ce71d09 commit 1de2009
Show file tree
Hide file tree
Showing 10 changed files with 452 additions and 8 deletions.
2 changes: 1 addition & 1 deletion docs/upgrade/v1-1-2-to-v1-2-0.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 5
sidebar_position: 6
sidebar_label: Upgrade from v1.1.2 to v1.2.0 (not recommended)
title: "Upgrade from v1.1.2 to v1.2.0 (not recommended)"
---
Expand Down
2 changes: 1 addition & 1 deletion docs/upgrade/v1-2-0-to-v1-2-1.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 4
sidebar_position: 5
sidebar_label: Upgrade from v1.1.2/v1.1.3/v1.2.0 to v1.2.1
title: "Upgrade from v1.1.2/v1.1.3/v1.2.0 to v1.2.1"
---
Expand Down
2 changes: 1 addition & 1 deletion docs/upgrade/v1-2-1-to-v1-2-2.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 3
sidebar_position: 4
sidebar_label: Upgrade from v1.2.1 to v1.2.2
title: "Upgrade from v1.2.1 to v1.2.2"
---
Expand Down
2 changes: 1 addition & 1 deletion docs/upgrade/v1-2-2-to-v1-3-1.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 2
sidebar_position: 3
sidebar_label: Upgrade from v1.2.2/v1.3.0 to v1.3.1
title: "Upgrade from v1.2.2/v1.3.0 to v1.3.1"
---
Expand Down
222 changes: 222 additions & 0 deletions docs/upgrade/v1-3-1-to-v1-3-2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,222 @@
---
sidebar_position: 2
sidebar_label: Upgrade from v1.3.1 to v1.3.2
title: "Upgrade from v1.3.1 to v1.3.2"
---

<head>
<link rel="canonical" href="https://docs.harvesterhci.io/v1.3/upgrade/v1-3-1-to-v1-3-2"/>
</head>

## General information

An **Upgrade** button appears on the **Dashboard** screen whenever a new Harvester version that you can upgrade to becomes available. For more information, see [Start an upgrade](./automatic.md#start-an-upgrade).

For air-gapped environments, see [Prepare an air-gapped upgrade](./automatic.md#prepare-an-air-gapped-upgrade).


## Known issues

---

### 1. Two-node cluster upgrade stuck after the first node is pre-drained

:::info important

Shut down all workload VMs before upgrading **two-node clusters** to prevent data loss.

:::

The worker node can falsely transition to a not-ready state when RKE2 is upgraded on the management node. Consequently, the existing pods on the worker node are evicted and new pods cannot be scheduled on any nodes. These ultimately cause a chained failure in the whole cluster and prevent completion of the upgrade process.

Check the cluster status when the following occur:

- The upgrade process becomes stuck for some time.
- You are unable to access the Harvester UI and receive an HTTP 503 error.

1. Check the conditions and node statuses of the latest `Upgrade` custom resource.

Proceed to the next step if the following conditions are met:

- `SystemServicesUpgraded` is set to `True`, indicating that the system services upgrade is completed.
- In `nodeStatuses`, the state of the management node is either `Pre-drained` or `Waiting Reboot`.
- In `nodeStatuses`, the state of the worker node is `Images preloaded`.

Example:

```
# Find out the latest Upgrade custom resource
$ kubectl -n harvester-system get upgrades.harvesterhci -l harvesterhci.io/latestUpgrade=true
NAME AGE
hvst-upgrade-szlg8 48m
# Check the conditions and node statuses
$ kubectl -n harvester-system get upgrades hvst-upgrade-szlg8 -o yaml
apiVersion: harvesterhci.io/v1beta1
kind: Upgrade
metadata:
...
labels:
harvesterhci.io/latestUpgrade: "true"
harvesterhci.io/upgradeState: UpgradingNodes
name: hvst-upgrade-szlg8
namespace: harvester-system
...
spec:
image: ""
logEnabled: false
version: v1.3.2-rc2
status:
conditions:
- status: Unknown
type: Completed
- lastUpdateTime: "2024-09-02T11:57:04Z"
message: Upgrade observability is administratively disabled
reason: Disabled
status: "False"
type: LogReady
- lastUpdateTime: "2024-09-02T11:58:01Z"
status: "True"
type: ImageReady
- lastUpdateTime: "2024-09-02T12:02:31Z"
status: "True"
type: RepoReady
- lastUpdateTime: "2024-09-02T12:18:44Z"
status: "True"
type: NodesPrepared
- lastUpdateTime: "2024-09-02T12:31:25Z"
status: "True"
type: SystemServicesUpgraded
- status: Unknown
type: NodesUpgraded
imageID: harvester-system/hvst-upgrade-szlg8
nodeStatuses:
harvester-c6phd:
state: Pre-drained
harvester-jkqhq:
state: Images preloaded
previousVersion: v1.3.1
...
```

2. Check the node status.

Proceed to the next step if the following conditions are met:

- The status of the worker node is `NotReady`.
- The status of the management node is `Ready,SchedulingDisabled`.

Example:

```
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
harvester-c6phd Ready,SchedulingDisabled control-plane,etcd,master 174m v1.28.12+rke2r1
harvester-jkqhq NotReady <none> 166m v1.27.13+rke2r1
```

3. Check the pods on the worker node.

The issue exists in the cluster if the status of most pods is `Terminating`.

Example:

```
# Assume harvester-jkqhq is the worker node
$ kubectl get pods -A --field-selector spec.nodeName=harvester-jkqhq
NAMESPACE NAME READY STATUS RESTARTS AGE
cattle-fleet-local-system fleet-agent-6779fb5dd9-dkpjz 1/1 Terminating 0 18m
cattle-fleet-system fleet-agent-86db8d9954-qgcpq 1/1 Terminating 2 (18m ago) 61m
cattle-fleet-system fleet-controller-696d4b8878-ddctd 1/1 Terminating 1 (19m ago) 29m
cattle-fleet-system gitjob-694dd97686-s4z68 1/1 Terminating 1 (19m ago) 29m
cattle-provisioning-capi-system capi-controller-manager-6f497d5574-wkrnf 1/1 Terminating 0 20m
cattle-system cattle-cluster-agent-76db9cf9fc-5hhsx 1/1 Terminating 0 20m
cattle-system cattle-cluster-agent-76db9cf9fc-dnr6m 1/1 Terminating 0 20m
cattle-system harvester-cluster-repo-7458c7c69d-p982g 1/1 Terminating 0 27m
cattle-system rancher-7d65df9bd4-77n7w 1/1 Terminating 0 31m
cattle-system rancher-webhook-cfc66d5d7-fd6gm 1/1 Terminating 0 28m
harvester-system harvester-85ff674986-wxkl4 1/1 Terminating 0 26m
harvester-system harvester-load-balancer-54cd9754dc-cwtxg 1/1 Terminating 0 20m
harvester-system harvester-load-balancer-webhook-c8699b786-x6clw 1/1 Terminating 0 20m
harvester-system harvester-network-controller-manager-b69bf6b69-9f99x 1/1 Terminating 0 178m
harvester-system harvester-network-controller-vs4jg 1/1 Running 0 178m
harvester-system harvester-network-webhook-7b98f8cd98-gjl8b 1/1 Terminating 0 20m
harvester-system harvester-node-disk-manager-tbh4b 1/1 Running 0 26m
harvester-system harvester-node-manager-7pqcp 1/1 Running 0 178m
harvester-system harvester-node-manager-webhook-9cfccc84c-68tgp 1/1 Running 0 20m
harvester-system harvester-node-manager-webhook-9cfccc84c-6bbvg 1/1 Running 0 20m
harvester-system harvester-webhook-565dc698b6-np89r 1/1 Terminating 0 26m
harvester-system hvst-upgrade-szlg8-apply-manifests-4rmjw 0/1 Completed 0 33m
harvester-system virt-api-6fb7d97b68-cbc5m 1/1 Terminating 0 20m
harvester-system virt-api-6fb7d97b68-gqg5c 1/1 Terminating 0 23m
harvester-system virt-controller-67d8b4c75c-5qz9x 1/1 Terminating 0 24m
harvester-system virt-controller-67d8b4c75c-bdf8w 1/1 Terminating 2 (18m ago) 23m
harvester-system virt-handler-xw98h 1/1 Running 0 24m
harvester-system virt-operator-6c98db546-brgnx 1/1 Terminating 2 (18m ago) 26m
kube-system harvester-snapshot-validation-webhook-b75f94bcb-95zlb 1/1 Terminating 0 20m
kube-system harvester-snapshot-validation-webhook-b75f94bcb-xfrmf 1/1 Terminating 0 20m
kube-system harvester-whereabouts-tdr5g 1/1 Running 1 (178m ago) 178m
kube-system helm-install-rke2-ingress-nginx-4wt4j 0/1 Terminating 0 15m
kube-system helm-install-rke2-metrics-server-jn58m 0/1 Terminating 0 15m
kube-system kube-proxy-harvester-jkqhq 1/1 Running 0 178m
kube-system rke2-canal-wfpch 2/2 Running 0 178m
kube-system rke2-coredns-rke2-coredns-864fbd7785-t7k6t 1/1 Terminating 0 178m
kube-system rke2-coredns-rke2-coredns-autoscaler-6c87968579-rg6g4 1/1 Terminating 0 20m
kube-system rke2-ingress-nginx-controller-d4h25 1/1 Running 0 178m
kube-system rke2-metrics-server-7f745dbddf-2mp5j 1/1 Terminating 0 20m
kube-system rke2-multus-fsp94 1/1 Running 0 178m
kube-system snapshot-controller-65d5f465d9-5b2sb 1/1 Terminating 0 20m
kube-system snapshot-controller-65d5f465d9-c264r 1/1 Terminating 0 20m
longhorn-system backing-image-manager-c16a-7c90 1/1 Terminating 0 54m
longhorn-system csi-attacher-5fbd66cf8-674vc 1/1 Terminating 0 20m
longhorn-system csi-attacher-5fbd66cf8-725mn 1/1 Terminating 0 20m
longhorn-system csi-attacher-5fbd66cf8-85k5d 1/1 Terminating 0 20m
longhorn-system csi-provisioner-5b6ff8f4d4-97wsf 1/1 Terminating 0 20m
longhorn-system csi-provisioner-5b6ff8f4d4-cbpm9 1/1 Terminating 0 20m
longhorn-system csi-provisioner-5b6ff8f4d4-q7z58 1/1 Terminating 0 19m
longhorn-system csi-resizer-74c5555748-6rmbf 1/1 Terminating 0 20m
longhorn-system csi-resizer-74c5555748-fw2cw 1/1 Terminating 0 20m
longhorn-system csi-resizer-74c5555748-p4nph 1/1 Terminating 0 20m
longhorn-system csi-snapshotter-6bc4bcf4c5-6858b 1/1 Terminating 0 20m
longhorn-system csi-snapshotter-6bc4bcf4c5-cqkbw 1/1 Terminating 0 20m
longhorn-system csi-snapshotter-6bc4bcf4c5-mkqtg 1/1 Terminating 0 20m
longhorn-system engine-image-ei-b0369a5d-2t4k4 1/1 Running 0 178m
longhorn-system instance-manager-a5bd20597b82bcf3ba9d314620b7e670 1/1 Terminating 0 178m
longhorn-system longhorn-csi-plugin-x6bdg 3/3 Running 0 178m
longhorn-system longhorn-driver-deployer-85cf4b4849-5lc52 1/1 Terminating 0 20m
longhorn-system longhorn-loop-device-cleaner-hhvgv 1/1 Running 0 178m
longhorn-system longhorn-manager-5h2zw 1/1 Running 0 178m
longhorn-system longhorn-ui-6b677889f8-hrg8j 1/1 Terminating 0 20m
longhorn-system longhorn-ui-6b677889f8-w5hng 1/1 Terminating 0 20m
```

To resolve the issue, you must restart the `rke2-agent` service on the worker node.

```
# On the worker node
sudo systemctl restart rke2-agent.service
```

The upgrade should resume after the `rke2-agent` service is fully restarted.

:::note

This issue occurs because the agent load balancer on the worker node is unable to connect to the API server on the management node after the `rke2-server` service is restarted. Because the `rke2-server` service can be restarted multiple times when nodes are upgraded, the upgrade process is likely to become stuck again. You may need to restart the `rke2-agent` service multiple times.

To determine if the agent load balancer is functioning, run the following commands:

```
# On the management node, check if the `rke2-server` service is running.
sudo systemctl status rke2-server.service
# On the worker node, check if the agent load balancer is functioning.
sudo /var/lib/rancher/rke2/bin/kubectl --kubeconfig=/var/lib/rancher/rke2/agent/kubelet.kubeconfig get nodes
```

If the kubectl command does not return a response, the kubelet is unable to access the API server via the agent load balancer. You must restart the `rke2-agent` service.

:::

For more information, see [Issue #6432](https://github.com/harvester/harvester/issues/6432#issuecomment-2325488465).

---
2 changes: 1 addition & 1 deletion versioned_docs/version-v1.3/upgrade/v1-1-2-to-v1-2-0.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 5
sidebar_position: 6
sidebar_label: Upgrade from v1.1.2 to v1.2.0 (not recommended)
title: "Upgrade from v1.1.2 to v1.2.0 (not recommended)"
---
Expand Down
2 changes: 1 addition & 1 deletion versioned_docs/version-v1.3/upgrade/v1-2-0-to-v1-2-1.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 4
sidebar_position: 5
sidebar_label: Upgrade from v1.1.2/v1.1.3/v1.2.0 to v1.2.1
title: "Upgrade from v1.1.2/v1.1.3/v1.2.0 to v1.2.1"
---
Expand Down
2 changes: 1 addition & 1 deletion versioned_docs/version-v1.3/upgrade/v1-2-1-to-v1-2-2.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 3
sidebar_position: 4
sidebar_label: Upgrade from v1.2.1 to v1.2.2
title: "Upgrade from v1.2.1 to v1.2.2"
---
Expand Down
2 changes: 1 addition & 1 deletion versioned_docs/version-v1.3/upgrade/v1-2-2-to-v1-3-1.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
sidebar_position: 2
sidebar_position: 3
sidebar_label: Upgrade from v1.2.2/v1.3.0 to v1.3.1
title: "Upgrade from v1.2.2/v1.3.0 to v1.3.1"
---
Expand Down
Loading

0 comments on commit 1de2009

Please sign in to comment.