diff --git a/docs/advanced/storageclass.md b/docs/advanced/storageclass.md index 47951f019c3..6c80497adc2 100644 --- a/docs/advanced/storageclass.md +++ b/docs/advanced/storageclass.md @@ -44,6 +44,12 @@ The number of replicas created for each volume in Longhorn. Defaults to `3`. ![](/img/v1.2/storageclass/create_storageclasses_replicas.png) +:::info important + +When the value is `1` it is a single-replica volume, it may block the [node maintenance](../host/host.md#node-maintenance), check the affection of [Single-Replica Volumes](../troubleshooting/host.md#single-replica-volumes). + +::: + #### Stale Replica Timeout Determines when Longhorn would clean up an error replica after the replica's status is ERROR. The unit is minute. Defaults to `30` minutes in Harvester. diff --git a/docs/host/host.md b/docs/host/host.md index 74fd499ab51..3532ebcd3e0 100644 --- a/docs/host/host.md +++ b/docs/host/host.md @@ -25,6 +25,20 @@ For admin users, you can click **Enable Maintenance Mode** to evict all VMs from ![node-maintenance.png](/img/v1.2/host/node-maintenance.png) +After a while the target node will enter maintenance mode successfully. + +![node-enter-maintenance-mode.png](/img/v1.3/troubleshooting/node-enter-maintenance-mode.png) + +:::info important + +Check those [known limitations and workarounds](../troubleshooting/host.md#an-enable-maintenance-mode-node-stucks-on-cordoned-state) before you click this menu or you have encountered some issues. + +If you have attached any volume to this node manually, it may block the node maintenance, check the affection of [Manually Attached Volumes](../troubleshooting/host.md#manually-attached-volumes) + +If you have any single-replica volume, it may block the node maintenance, check check the affection of [Single-Replica Volumes](../troubleshooting/host.md#single-replica-volumes) + +::: + ## Cordoning a Node Cordoning a node marks it as unschedulable. This feature is useful for performing short tasks on the node during small maintenance windows, like reboots, upgrades, or decommissions. When you’re done, power back on and make the node schedulable again by uncordoning it. @@ -42,6 +56,8 @@ Before removing a node from a Harvester cluster, determine if the remaining node If the remaining nodes do not have enough resources, VMs might fail to migrate and volumes might degrade when you remove a node. +If you have some volumes which were created from the customized `StorageClass` with the value **1** of the [Number of Replicas](../advanced/storageclass.md#number-of-replicas), it is recommended to backup those single-replica volumes or re-deploy the related workloads to other node in advance to get the volume scheduled to other node. Otherwise, those volumes can't be rebuilt or restored from other nodes after this node is removed. + ::: ### 1. Check if the node can be removed from the cluster. @@ -522,4 +538,4 @@ status: ``` The `harvester-node-manager` pod(s) in the `harvester-system` namespace may also contain some hints as to why it is not rendering a file to a node. -This pod is part of a daemonset, so it may be worth checking the pod that is running on the node of interest. \ No newline at end of file +This pod is part of a daemonset, so it may be worth checking the pod that is running on the node of interest. diff --git a/docs/troubleshooting/host.md b/docs/troubleshooting/host.md new file mode 100644 index 00000000000..62481a7586c --- /dev/null +++ b/docs/troubleshooting/host.md @@ -0,0 +1,139 @@ +--- +sidebar_position: 6 +sidebar_label: Host +title: "Host" +--- + + + + + +## Node in Maintenance Mode Becomes Stuck in Cordoned State + +When you enable Maintenance Mode on a node using the Harvester UI, the node becomes stuck in the `Cordoned` state and the menu shows the **Enable Maintenance Mode** option instead of **Disable Maintenance Mode**. + +![node-stuck-cordoned.png](/img/v1.3/troubleshooting/node-stuck-cordoned.png) + +The Harvester pod logs contain messages similar to the following: + +``` +time="2024-08-05T19:03:02Z" level=info msg="evicting pod longhorn-system/instance-manager-68cd2514dd3f6d59b95cbd865d5b08f7" +time="2024-08-05T19:03:02Z" level=info msg="error when evicting pods/\"instance-manager-68cd2514dd3f6d59b95cbd865d5b08f7\" -n \"longhorn-system\" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget." + +time="2024-08-05T19:03:07Z" level=info msg="evicting pod longhorn-system/instance-manager-68cd2514dd3f6d59b95cbd865d5b08f7" +time="2024-08-05T19:03:07Z" level=info msg="error when evicting pods/\"instance-manager-68cd2514dd3f6d59b95cbd865d5b08f7\" -n \"longhorn-system\" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget." + +time="2024-08-05T19:03:12Z" level=info msg="evicting pod longhorn-system/instance-manager-68cd2514dd3f6d59b95cbd865d5b08f7" +time="2024-08-05T19:03:12Z" level=info msg="error when evicting pods/\"instance-manager-68cd2514dd3f6d59b95cbd865d5b08f7\" -n \"longhorn-system\" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget." +``` + +The Longhorn Instance Manager uses a PodDisruptionBudget (PDB) to protect itself from accidental eviction, which results in loss of volume data. When the Maintenance Mode error occurs, it indicates that the `instance-manager` pod is still serving volumes or replicas. + +The following sections describe the known causes and their corresponding workarounds. + +### Manually Attached Volumes + +A volume that is attached to a node using the [embedded Longhorn UI](./harvester.md#access-embedded-rancher-and-longhorn-dashboards) can cause the error. This is because the object is attached to a node name instead of the pod name. + +You can check it from the [Embedded Longhorn UI](./harvester.md#access-embedded-rancher-and-longhorn-dashboards). + +![attached-volume.png](/img/v1.3/troubleshooting/attached-volume.png) + +The manually attached object is attached to a node name instead of the pod name. + +You can also use the CLI to retrieve the details of the CRD object `VolumeAttachment`. + +Example of a volume that was attached using the Longhorn UI: + +``` +- apiVersion: longhorn.io/v1beta2 + kind: VolumeAttachment +... + spec: + attachmentTickets: + longhorn-ui: + id: longhorn-ui + nodeID: node-name +... + volume: pvc-9b35136c-f59e-414b-aa55-b84b9b21ff89 +``` + +Example of a volume that was attached using the Longhorn CSI driver: + +``` +- apiVersion: longhorn.io/v1beta2 + kind: VolumeAttachment + spec: + attachmentTickets: + csi-b5097155cddde50b4683b0e659923e379cbfc3873b5b2ee776deb3874102e9bf: + id: csi-b5097155cddde50b4683b0e659923e379cbfc3873b5b2ee776deb3874102e9bf + nodeID: node-name +... + volume: pvc-3c6403cd-f1cd-4b84-9b46-162f746b9667 +``` + +:::note + +Manually attaching a volume to the node is not recommended. + +::: + +#### Workaround 1: Set `Detach Manually Attached Volumes When Cordoned` to `True` + +The Longhorn setting [Detach Manually Attached Volumes When Cordoned](https://longhorn.io/docs/1.6.0/references/settings/#detach-manually-attached-volumes-when-cordoned) blocks node draining when there are volumes manually attached to the node. + +The default value of this setting depends on the embedded Longhorn version: + +| Harvester version | Embedded Longhorn version | Default value | +| --- | --- | --- | +| v1.3.1 | v1.6.0 | `true` | +| v1.4.0 | v1.7.0 | `false` | + +Set this option to `true` from the [embedded Longhorn UI](./harvester.md#access-embedded-rancher-and-longhorn-dashboards). + +#### Workaround 2: Manually Detach the Volume + +Detach the volume using the [embedded Longhorn UI](./harvester.md#access-embedded-rancher-and-longhorn-dashboards). + +![detached-volume.png](/img/v1.3/troubleshooting/detached-volume.png) + +Once the volume is detached, you can successfully enable Maintenance Mode on the node. + +![node-enter-maintenance-mode.png](/img/v1.3/troubleshooting/node-enter-maintenance-mode.png) + +### Single-Replica Volumes + +Harvester allows you to create custom StorageClasses that describe how Longhorn must provision volumes. If necessary, you can create a StorageClass with the [Number of Replicas](../advanced/storageclass.md#number-of-replicas) parameter set to `1`. + +When a volume is created using such a StorageClass and is attached to a node using the CSI driver or other methods, the lone replica stays on that node even after the volume is detached. + +You can check this using the CRD object `Volume`. + +``` +- apiVersion: longhorn.io/v1beta2 + kind: Volume +... + spec: +... + numberOfReplicas: 1 // the replica number +... + status: +... + ownerID: nodeName +... + state: attached +``` + +#### Workaround: Set `Node Drain Policy` + +The Longhorn [Node Drain Policy](https://longhorn.io/docs/1.6.0/references/settings/#node-drain-policy) is set to `block-if-contains-last-replica` by default. This option forces Longhorn to block node draining when the node contains the last healthy replica of a volume. + +To address the issue, change the value to `allow-if-replica-is-stopped` using the [embedded Longhorn UI](./harvester.md#access-embedded-rancher-and-longhorn-dashboards). + +:::info important + +If you plan to remove the node after Maintenance Mode is enabled, back up single-replica volumes or redeploy the related workloads to other nodes in advance so that the volumes are scheduled to other nodes. + +::: + +Starting with Harvester v1.4.0, the `Node Drain Policy` is set to `allow-if-replica-is-stopped` by default. diff --git a/docs/volume/create-volume.md b/docs/volume/create-volume.md index d26c9a01ced..3e525baee9e 100644 --- a/docs/volume/create-volume.md +++ b/docs/volume/create-volume.md @@ -29,6 +29,12 @@ description: Create a volume from the Volume page. ![create-empty-volume](/img/v1.2/volume/create-empty-volume.png) +:::info important + +Harvester attaches/detaches volumes based on user operations dynamically. If you plans to attach any volume to certain node manually, it may block the [node maintenance](../host/host.md#node-maintenance), check the affection of [Manually Attached Volumes](../troubleshooting/host.md#manually-attached-volumes). + +::: + diff --git a/static/img/v1.3/troubleshooting/attached-volume.png b/static/img/v1.3/troubleshooting/attached-volume.png new file mode 100644 index 00000000000..41e53397a32 Binary files /dev/null and b/static/img/v1.3/troubleshooting/attached-volume.png differ diff --git a/static/img/v1.3/troubleshooting/detached-volume.png b/static/img/v1.3/troubleshooting/detached-volume.png new file mode 100644 index 00000000000..a4ebef7f00e Binary files /dev/null and b/static/img/v1.3/troubleshooting/detached-volume.png differ diff --git a/static/img/v1.3/troubleshooting/node-enter-maintenance-mode.png b/static/img/v1.3/troubleshooting/node-enter-maintenance-mode.png new file mode 100644 index 00000000000..9779e39e799 Binary files /dev/null and b/static/img/v1.3/troubleshooting/node-enter-maintenance-mode.png differ diff --git a/static/img/v1.3/troubleshooting/node-stuck-cordoned.png b/static/img/v1.3/troubleshooting/node-stuck-cordoned.png new file mode 100644 index 00000000000..7b8e89db9cb Binary files /dev/null and b/static/img/v1.3/troubleshooting/node-stuck-cordoned.png differ diff --git a/versioned_docs/version-v1.3/advanced/storageclass.md b/versioned_docs/version-v1.3/advanced/storageclass.md index 682c8dce369..19532025286 100644 --- a/versioned_docs/version-v1.3/advanced/storageclass.md +++ b/versioned_docs/version-v1.3/advanced/storageclass.md @@ -41,6 +41,12 @@ The number of replicas created for each volume in Longhorn. Defaults to `3`. ![](/img/v1.2/storageclass/create_storageclasses_replicas.png) +:::info important + +When the value is `1` it is a single-replica volume, it may block the [node maintenance](../host/host.md#node-maintenance), check the affection of [Single-Replica Volumes](../troubleshooting/host.md#single-replica-volumes). + +::: + #### Stale Replica Timeout Determines when Longhorn would clean up an error replica after the replica's status is ERROR. The unit is minute. Defaults to `30` minutes in Harvester. @@ -148,4 +154,4 @@ Then, create a new `StorageClass` for the HDD (use the above disk tags). For har You can now create a volume using the above `StorageClass` with HDDs mostly for cold storage or archiving purpose. -![](/img/v1.2/storageclass/create_volume_hdd.png) \ No newline at end of file +![](/img/v1.2/storageclass/create_volume_hdd.png) diff --git a/versioned_docs/version-v1.3/host/host.md b/versioned_docs/version-v1.3/host/host.md index 74fd499ab51..3532ebcd3e0 100644 --- a/versioned_docs/version-v1.3/host/host.md +++ b/versioned_docs/version-v1.3/host/host.md @@ -25,6 +25,20 @@ For admin users, you can click **Enable Maintenance Mode** to evict all VMs from ![node-maintenance.png](/img/v1.2/host/node-maintenance.png) +After a while the target node will enter maintenance mode successfully. + +![node-enter-maintenance-mode.png](/img/v1.3/troubleshooting/node-enter-maintenance-mode.png) + +:::info important + +Check those [known limitations and workarounds](../troubleshooting/host.md#an-enable-maintenance-mode-node-stucks-on-cordoned-state) before you click this menu or you have encountered some issues. + +If you have attached any volume to this node manually, it may block the node maintenance, check the affection of [Manually Attached Volumes](../troubleshooting/host.md#manually-attached-volumes) + +If you have any single-replica volume, it may block the node maintenance, check check the affection of [Single-Replica Volumes](../troubleshooting/host.md#single-replica-volumes) + +::: + ## Cordoning a Node Cordoning a node marks it as unschedulable. This feature is useful for performing short tasks on the node during small maintenance windows, like reboots, upgrades, or decommissions. When you’re done, power back on and make the node schedulable again by uncordoning it. @@ -42,6 +56,8 @@ Before removing a node from a Harvester cluster, determine if the remaining node If the remaining nodes do not have enough resources, VMs might fail to migrate and volumes might degrade when you remove a node. +If you have some volumes which were created from the customized `StorageClass` with the value **1** of the [Number of Replicas](../advanced/storageclass.md#number-of-replicas), it is recommended to backup those single-replica volumes or re-deploy the related workloads to other node in advance to get the volume scheduled to other node. Otherwise, those volumes can't be rebuilt or restored from other nodes after this node is removed. + ::: ### 1. Check if the node can be removed from the cluster. @@ -522,4 +538,4 @@ status: ``` The `harvester-node-manager` pod(s) in the `harvester-system` namespace may also contain some hints as to why it is not rendering a file to a node. -This pod is part of a daemonset, so it may be worth checking the pod that is running on the node of interest. \ No newline at end of file +This pod is part of a daemonset, so it may be worth checking the pod that is running on the node of interest. diff --git a/versioned_docs/version-v1.3/troubleshooting/host.md b/versioned_docs/version-v1.3/troubleshooting/host.md new file mode 100644 index 00000000000..62481a7586c --- /dev/null +++ b/versioned_docs/version-v1.3/troubleshooting/host.md @@ -0,0 +1,139 @@ +--- +sidebar_position: 6 +sidebar_label: Host +title: "Host" +--- + + + + + +## Node in Maintenance Mode Becomes Stuck in Cordoned State + +When you enable Maintenance Mode on a node using the Harvester UI, the node becomes stuck in the `Cordoned` state and the menu shows the **Enable Maintenance Mode** option instead of **Disable Maintenance Mode**. + +![node-stuck-cordoned.png](/img/v1.3/troubleshooting/node-stuck-cordoned.png) + +The Harvester pod logs contain messages similar to the following: + +``` +time="2024-08-05T19:03:02Z" level=info msg="evicting pod longhorn-system/instance-manager-68cd2514dd3f6d59b95cbd865d5b08f7" +time="2024-08-05T19:03:02Z" level=info msg="error when evicting pods/\"instance-manager-68cd2514dd3f6d59b95cbd865d5b08f7\" -n \"longhorn-system\" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget." + +time="2024-08-05T19:03:07Z" level=info msg="evicting pod longhorn-system/instance-manager-68cd2514dd3f6d59b95cbd865d5b08f7" +time="2024-08-05T19:03:07Z" level=info msg="error when evicting pods/\"instance-manager-68cd2514dd3f6d59b95cbd865d5b08f7\" -n \"longhorn-system\" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget." + +time="2024-08-05T19:03:12Z" level=info msg="evicting pod longhorn-system/instance-manager-68cd2514dd3f6d59b95cbd865d5b08f7" +time="2024-08-05T19:03:12Z" level=info msg="error when evicting pods/\"instance-manager-68cd2514dd3f6d59b95cbd865d5b08f7\" -n \"longhorn-system\" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget." +``` + +The Longhorn Instance Manager uses a PodDisruptionBudget (PDB) to protect itself from accidental eviction, which results in loss of volume data. When the Maintenance Mode error occurs, it indicates that the `instance-manager` pod is still serving volumes or replicas. + +The following sections describe the known causes and their corresponding workarounds. + +### Manually Attached Volumes + +A volume that is attached to a node using the [embedded Longhorn UI](./harvester.md#access-embedded-rancher-and-longhorn-dashboards) can cause the error. This is because the object is attached to a node name instead of the pod name. + +You can check it from the [Embedded Longhorn UI](./harvester.md#access-embedded-rancher-and-longhorn-dashboards). + +![attached-volume.png](/img/v1.3/troubleshooting/attached-volume.png) + +The manually attached object is attached to a node name instead of the pod name. + +You can also use the CLI to retrieve the details of the CRD object `VolumeAttachment`. + +Example of a volume that was attached using the Longhorn UI: + +``` +- apiVersion: longhorn.io/v1beta2 + kind: VolumeAttachment +... + spec: + attachmentTickets: + longhorn-ui: + id: longhorn-ui + nodeID: node-name +... + volume: pvc-9b35136c-f59e-414b-aa55-b84b9b21ff89 +``` + +Example of a volume that was attached using the Longhorn CSI driver: + +``` +- apiVersion: longhorn.io/v1beta2 + kind: VolumeAttachment + spec: + attachmentTickets: + csi-b5097155cddde50b4683b0e659923e379cbfc3873b5b2ee776deb3874102e9bf: + id: csi-b5097155cddde50b4683b0e659923e379cbfc3873b5b2ee776deb3874102e9bf + nodeID: node-name +... + volume: pvc-3c6403cd-f1cd-4b84-9b46-162f746b9667 +``` + +:::note + +Manually attaching a volume to the node is not recommended. + +::: + +#### Workaround 1: Set `Detach Manually Attached Volumes When Cordoned` to `True` + +The Longhorn setting [Detach Manually Attached Volumes When Cordoned](https://longhorn.io/docs/1.6.0/references/settings/#detach-manually-attached-volumes-when-cordoned) blocks node draining when there are volumes manually attached to the node. + +The default value of this setting depends on the embedded Longhorn version: + +| Harvester version | Embedded Longhorn version | Default value | +| --- | --- | --- | +| v1.3.1 | v1.6.0 | `true` | +| v1.4.0 | v1.7.0 | `false` | + +Set this option to `true` from the [embedded Longhorn UI](./harvester.md#access-embedded-rancher-and-longhorn-dashboards). + +#### Workaround 2: Manually Detach the Volume + +Detach the volume using the [embedded Longhorn UI](./harvester.md#access-embedded-rancher-and-longhorn-dashboards). + +![detached-volume.png](/img/v1.3/troubleshooting/detached-volume.png) + +Once the volume is detached, you can successfully enable Maintenance Mode on the node. + +![node-enter-maintenance-mode.png](/img/v1.3/troubleshooting/node-enter-maintenance-mode.png) + +### Single-Replica Volumes + +Harvester allows you to create custom StorageClasses that describe how Longhorn must provision volumes. If necessary, you can create a StorageClass with the [Number of Replicas](../advanced/storageclass.md#number-of-replicas) parameter set to `1`. + +When a volume is created using such a StorageClass and is attached to a node using the CSI driver or other methods, the lone replica stays on that node even after the volume is detached. + +You can check this using the CRD object `Volume`. + +``` +- apiVersion: longhorn.io/v1beta2 + kind: Volume +... + spec: +... + numberOfReplicas: 1 // the replica number +... + status: +... + ownerID: nodeName +... + state: attached +``` + +#### Workaround: Set `Node Drain Policy` + +The Longhorn [Node Drain Policy](https://longhorn.io/docs/1.6.0/references/settings/#node-drain-policy) is set to `block-if-contains-last-replica` by default. This option forces Longhorn to block node draining when the node contains the last healthy replica of a volume. + +To address the issue, change the value to `allow-if-replica-is-stopped` using the [embedded Longhorn UI](./harvester.md#access-embedded-rancher-and-longhorn-dashboards). + +:::info important + +If you plan to remove the node after Maintenance Mode is enabled, back up single-replica volumes or redeploy the related workloads to other nodes in advance so that the volumes are scheduled to other nodes. + +::: + +Starting with Harvester v1.4.0, the `Node Drain Policy` is set to `allow-if-replica-is-stopped` by default. diff --git a/versioned_docs/version-v1.3/volume/create-volume.md b/versioned_docs/version-v1.3/volume/create-volume.md index 66add1ddea1..26ef8e7f95b 100644 --- a/versioned_docs/version-v1.3/volume/create-volume.md +++ b/versioned_docs/version-v1.3/volume/create-volume.md @@ -26,6 +26,12 @@ description: Create a volume from the Volume page. ![create-empty-volume](/img/v1.2/volume/create-empty-volume.png) +:::info important + +Harvester attaches/detaches volumes based on user operations dynamically. If you plans to attach any volume to certain node manually, it may block the [node maintenance](../host/host.md#node-maintenance), check the affection of [Manually Attached Volumes](../troubleshooting/host.md#manually-attached-volumes). + +::: + ## Create an Image Volume ### Header Section @@ -38,4 +44,4 @@ description: Create a volume from the Volume page. 1. Select an existing `Image`. 1. Configure the `Size` of the volume. -![create-image-volume](/img/v1.2/volume/create-image-volume.png) \ No newline at end of file +![create-image-volume](/img/v1.2/volume/create-image-volume.png)