diff --git a/enhancements/20230905-automatically-evict-replicas-while-draining.md b/enhancements/20230905-automatically-evict-replicas-while-draining.md index 2d9e94a68a..a95f255f84 100644 --- a/enhancements/20230905-automatically-evict-replicas-while-draining.md +++ b/enhancements/20230905-automatically-evict-replicas-while-draining.md @@ -43,7 +43,7 @@ drained: - There is no opportunity for Longhorn to protect data. -This proposal seeks to add a fourth behavior (node drain policy) with the following properties: +This proposal seeks to add a fourth and fifth behavior (node drain policy) with the following properties: - `Block For Eviction` ensures the `instance-manager` pod cannot be drained from a node as long as it contains any replicas for any volumes. Replicas are automatically evicted from the node as soon as it is cordoned. @@ -58,7 +58,28 @@ This proposal seeks to add a fourth behavior (node drain policy) with the follow - The drain operation is significantly slower than for other behaviors. Every replica must be rebuilt on another node before it can complete. - - Like all of these policities, it triggers on cordon, not on drain (it is not possible for Longhorn to distinguish + - The drain operation is data-intensive, especially when replica auto balance is enabled, as evicted replicas may be + moved back to the drained node when/if it comes back online. + - Like all of these policies, it triggers on cordon, not on drain (it is not possible for Longhorn to distinguish + between a node that is actively being drained and one that is cordoned for some other reason). If a user + regularly cordons nodes without draining them, replicas will be rebuilt pointlessly. + +- `Block For Eviction If Contains Last Replica` ensures the `instance-manager` pod cannot be drained from a node as long + as it is the last node with a healthy replica for some volume. Replicas that meet this condition are automatically + evicted from the node as soon as it is cordoned. + + Benefits: + + - Protects data by preventing the drain operation from completing until there is a healthy replica available for each + volume available on another node. + - Automatically evicts replicas, so the user does not need to do it manually (through the UI). + - The drain operation is only as slow and data-intensive as is necessary to protect data. + + Drawbacks: + + - Volumes may be degraded after the drain is complete. If the node is rebooted, redundancy is reduced until it is + running again. If the node is removed, redundancy is reduced until another replica rebuilds. + - Like all of these policies, it triggers on cordon, not on drain (it is not possible for Longhorn to distinguish between a node that is actively being drained and one that is cordoned for some other reason). If a user regularly cordons nodes without draining them, replicas will be rebuilt pointlessly. @@ -66,6 +87,11 @@ Given the drawbacks, `Block For Eviction` should likely not be the default node some users may find it helpful to switch to `Block For Eviction`, especially during cluster upgrade operations. See [user stories](#user-stories) for additional insight. +`Block For Eviction If Contains Last Replica` is a more efficient behavior that may make sense as the long-term setting +in some clusters. It has largely the same benefits and drawbacks as `Block If Contains Last Replica`, except that it +doesn't require users to perform any manual drains. Still, when data is properly backed up and the user is planning to +bring a node back online after maintenance, it is costlier than other options. + ### Related Issues https://github.com/longhorn/longhorn/issues/2238 @@ -112,8 +138,8 @@ nodes. It may take a long time, but I know my data is safe when the drain comple ### API changes -Add a `block-for-eviction` option to the `node-drain-policy` setting. The user chooses this option to opt in to the new -behavior. +Add `block-for-eviction` and `block-for-eviction-if-last-replica` options to the `node-drain-policy` setting. The user +chooses these options to opt in to the new behavior. Add a `status.autoEvicting` to the `node.longhorn.io/v1beta2` custom resource. This is not a field users can/should interact with, but they can view it via kubectl. @@ -125,11 +151,14 @@ with the type `Evicting`. However, this was a bit less natural, because: - During normal operation, `Evicting` should be `False`. The Longhorn UI displays a condition in this state with a red symbol, indicating an error state that should be investigated. +Deprecate the `replica.status.evictionRequested` field in favor of a new `replica.spec.evictionRequested` field so that +replica eviction can be requested by the node controller instead of the replica controller. + ## Design ### Implementation Overview -The existing eviction logic is well-tested, so there is no reason to refactor it. It works as follows: +The existing eviction logic is well-tested, so there is no reason to significantly refactor it. It works as follows: - The user can set `spec.evictionRequested = true` on a node or disk. - When the replica controller sees `spec.evictionRequested == true` on the node or disk hosting a replica, it sets @@ -145,27 +174,37 @@ The existing eviction logic is well-tested, so there is no reason to refactor it - NOTE: If a new replica already started rebuilding as part of an eviction, it continues to rebuild and remains in the cluster even after eviction is canceled. It can be cleaned up manually if desired. -Make slight changes so that: - -- The node controller checks `status.Unschedulable` on the appropriate Kubernetes node object. If - `status.Unschedulable == true` and the node drain policy is `block-for-eviction`, it sets `status.autoEvicting = true` - on the appropriate `node.longhorn.io` object. -- In addition to its pre-existing checks, if the replica controller sees `status.autoEvicting == true` on the node - hosting a replica, it sets `status.evictionRequested = true` on that replica. -- The volume controller still uses `replica.status.evictionRequested == true` to influence replica - scheduling/deletionbehavior (e.g. rebuild an extra replica to replace the evicting one or delete the evicting one once - rebuilding is complete). -- The node controller checks `status.Unschedulable` on the appropriate Kubernetes node object. If - `status.Unschedulable == false` or the node drain policy is no longer `block-for-eviction`, it sets - `status.autoEvicting = false` on the appropriate `node.longhorn.io` object. -- In addition to its pre-existing checks, if the replica controller sees `status.autoEvicting == false` on the node - hosting a replica, it may set `status.evictionRequested = false` on that replica. -- NOTE: If a new replica already started rebuilding as part of an automatic eviction, it continues to rebuild and - remains in the cluster even after eviction is canceled. It can be cleaned up manually if desired. +Make changes so that: + +- The node controller (not the replica controller) sets `replica.spec.evictionRequested = true` when: + - `spec.evictionRequested == true` on the replica's node (similar to existing behavior moved from the replica + controller), OR + - `spec.evictionRequested == true` on the replica's disk. (similar to existing behavior moved from the replica + controller), OR + - `status.Unschedulable == true` on the associated Kubernetes node object and the node drain policy is + `block-for-eviction`, OR + - `status.Unschedulable == true` on the associated Kubernetes node object, the node drain policy is + `block-for-eviction-if-contains-last-replica`, and there are no other PDB-protected replicas for a volume. +- Much of the logic currently used by the instance manager controller to recognize PDB-protected replicas is moved to + utility functions so both the node and instance manager controllers can use it. +- The volume controller uses `replica.spec.evictionRequested == true` in exactly the same way it previously used + `replica.status.evictionRequested` to influence replica scheduling/deletion behavior (e.g. rebuild an extra replica + to replace the evicting one or delete the evicting one once rebuilding is complete). +- The node controller sets `replica.spec.evictionRequested = false` when: + - `spec.evictionRequested == false` on the replica's node, AND + - `spec.evictionRequested == false` on the replica's disk, AND + - `status.Unschedulable == false` on the associated Kubernetes node object, OR + - `status.Unschedulable == true` on the associated Kubernetes node object and the node drain policy is not + `block-for-eviction` or `block-for-eviction-if-contains-last-replica`, OR + - `status.Unschedulable == true` on the associated Kubernetes node object, the node drain policy is + `block-for-eviction-if-contains-last-replica`, and there is another PDB-protected replica. +- The node controller sets `status.autoEvicting = true` when a node has evicting replicas because of the new drain + policies and `status.autoEvicting == false` when it does not. This provides a clue to the user (and the UI) while auto + eviction is ongoing. ### Test plan -Test normal behavior: +Test normal behavior with `block-for-eviction`: - Create a volume. - Ensure (through soft anti-affinity, low replica count, and/or enough disks) that an evicted replica of the volume can @@ -175,12 +214,36 @@ Test normal behavior: - While the drain is ongoing: - Verify that the volume never becomes degraded. - Verify that `node.status.autoEvicting == true`. + - Verify that `replica.spec.evictionRequested == true`. - Verify the drain completes. - Uncordon the node. +- Verify the replica on the drained node has moved to a different one. - Verify that `node.status.autoEvicting == false`. +- Verify that `replica.spec.evictionRequested == false`. - Verify the volume's data. -Test unschedulable behavior: +Test normal behavior with `block-for-eviction-if-contains-last-replica`: + +- Create one volume with a single replica and another volume with three replicas. +- Ensure (through soft anti-affinity, low replica count, and/or enough disks) that evicted replicas of both volumes can + be scheduled elsewhere. +- Write data to the volumes. +- Drain a node both volumes have a replica scheduled to. +- While the drain is ongoing: + - Verify that the volume with one replica never becomes degraded. + - Verify that the volume with three replicas becomes degraded. + - Verify that `node.status.autoEvicting == true`. + - Verify that `replica.spec.evictionRequested == true` on the replica for the volume that only has one. + - Verify that `replica.spec.evictionRequested == false` on the replica for the volume that has three. +- Verify the drain completes. +- Uncordon the node. +- Verify the replica for the volume with one replica has moved to a different node. +- Verify the replica for the volume with three replicas has not moved. +- Verify that `node.status.autoEvicting == false`. +- Verify that `replica.spec.evictionRequested == false` on all replicas. +- Verify the the data in both volumes. + +Test unschedulable behavior with `block-for-eviction`: - Create a volume. - Ensure (through soft anti-affinity, high replica count, and/or not enough disks) that an evicted replica of the volume @@ -198,8 +261,9 @@ Test unschedulable behavior: ### Upgrade strategy -Add `status.autoEvicting == false` to all `node.longhorn.io` objects during the upgrade. The default node drain policy -remains `Block If Contains Last Replica`, so do not make setting changes. +- Add `status.autoEvicting = false` to all `node.longhorn.io` objects during the upgrade. +- Add `spec.evictionRequested = status.evictionRequested` to all replica objects during the upgrade. +- The default node drain policy remains `Block If Contains Last Replica`, so do not make setting changes. ## Note