Skip to content

Commit

Permalink
Automatically evict replicas while draining LEP second major draft
Browse files Browse the repository at this point in the history
Longhorn 2238

Signed-off-by: Eric Weber <eric.weber@suse.com>
  • Loading branch information
ejweber committed Oct 24, 2023
1 parent 93165ac commit fbd7036
Showing 1 changed file with 90 additions and 26 deletions.
116 changes: 90 additions & 26 deletions enhancements/20230905-automatically-evict-replicas-while-draining.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ drained:

- There is no opportunity for Longhorn to protect data.

This proposal seeks to add a fourth behavior (node drain policy) with the following properties:
This proposal seeks to add a fourth and fifth behavior (node drain policy) with the following properties:

- `Block For Eviction` ensures the `instance-manager` pod cannot be drained from a node as long as it contains any
replicas for any volumes. Replicas are automatically evicted from the node as soon as it is cordoned.
Expand All @@ -58,14 +58,40 @@ This proposal seeks to add a fourth behavior (node drain policy) with the follow

- The drain operation is significantly slower than for other behaviors. Every replica must be rebuilt on another node
before it can complete.
- Like all of these policities, it triggers on cordon, not on drain (it is not possible for Longhorn to distinguish
- The drain operation is data-intensive, especially when replica auto balance is enabled, as evicted replicas may be
moved back to the drained node when/if it comes back online.
- Like all of these policies, it triggers on cordon, not on drain (it is not possible for Longhorn to distinguish
between a node that is actively being drained and one that is cordoned for some other reason). If a user
regularly cordons nodes without draining them, replicas will be rebuilt pointlessly.

- `Block For Eviction If Contains Last Replica` ensures the `instance-manager` pod cannot be drained from a node as long
as it is the last node with a healthy replica for some volume. Replicas that meet this condition are automatically
evicted from the node as soon as it is cordoned.

Benefits:

- Protects data by preventing the drain operation from completing until there is a healthy replica available for each
volume available on another node.
- Automatically evicts replicas, so the user does not need to do it manually (through the UI).
- The drain operation is only as slow and data-intensive as is necessary to protect data.

Drawbacks:

- Volumes may be degraded after the drain is complete. If the node is rebooted, redundancy is reduced until it is
running again. If the node is removed, redundancy is reduced until another replica rebuilds.
- Like all of these policies, it triggers on cordon, not on drain (it is not possible for Longhorn to distinguish
between a node that is actively being drained and one that is cordoned for some other reason). If a user
regularly cordons nodes without draining them, replicas will be rebuilt pointlessly.

Given the drawbacks, `Block For Eviction` should likely not be the default node drain policy moving forward. However,
some users may find it helpful to switch to `Block For Eviction`, especially during cluster upgrade operations. See
[user stories](#user-stories) for additional insight.

`Block For Eviction If Contains Last Replica` is a more efficient behavior that may make sense as the long-term setting
in some clusters. It has largely the same benefits and drawbacks as `Block If Contains Last Replica`, except that it
doesn't require users to perform any manual drains. Still, when data is properly backed up and the user is planning to
bring a node back online after maintenance, it is costlier than other options.

### Related Issues

https://github.com/longhorn/longhorn/issues/2238
Expand Down Expand Up @@ -112,8 +138,8 @@ nodes. It may take a long time, but I know my data is safe when the drain comple

### API changes

Add a `block-for-eviction` option to the `node-drain-policy` setting. The user chooses this option to opt in to the new
behavior.
Add `block-for-eviction` and `block-for-eviction-if-last-replica` options to the `node-drain-policy` setting. The user
chooses these options to opt in to the new behavior.

Add a `status.autoEvicting` to the `node.longhorn.io/v1beta2` custom resource. This is not a field users can/should
interact with, but they can view it via kubectl.
Expand All @@ -125,11 +151,14 @@ with the type `Evicting`. However, this was a bit less natural, because:
- During normal operation, `Evicting` should be `False`. The Longhorn UI displays a condition in this state with a red
symbol, indicating an error state that should be investigated.

Deprecate the `replica.status.evictionRequested` field in favor of a new `replica.spec.evictionRequested` field so that
replica eviction can be requested by the node controller instead of the replica controller.

## Design

### Implementation Overview

The existing eviction logic is well-tested, so there is no reason to refactor it. It works as follows:
The existing eviction logic is well-tested, so there is no reason to significantly refactor it. It works as follows:

- The user can set `spec.evictionRequested = true` on a node or disk.
- When the replica controller sees `spec.evictionRequested == true` on the node or disk hosting a replica, it sets
Expand All @@ -145,27 +174,37 @@ The existing eviction logic is well-tested, so there is no reason to refactor it
- NOTE: If a new replica already started rebuilding as part of an eviction, it continues to rebuild and remains in the
cluster even after eviction is canceled. It can be cleaned up manually if desired.

Make slight changes so that:

- The node controller checks `status.Unschedulable` on the appropriate Kubernetes node object. If
`status.Unschedulable == true` and the node drain policy is `block-for-eviction`, it sets `status.autoEvicting = true`
on the appropriate `node.longhorn.io` object.
- In addition to its pre-existing checks, if the replica controller sees `status.autoEvicting == true` on the node
hosting a replica, it sets `status.evictionRequested = true` on that replica.
- The volume controller still uses `replica.status.evictionRequested == true` to influence replica
scheduling/deletionbehavior (e.g. rebuild an extra replica to replace the evicting one or delete the evicting one once
rebuilding is complete).
- The node controller checks `status.Unschedulable` on the appropriate Kubernetes node object. If
`status.Unschedulable == false` or the node drain policy is no longer `block-for-eviction`, it sets
`status.autoEvicting = false` on the appropriate `node.longhorn.io` object.
- In addition to its pre-existing checks, if the replica controller sees `status.autoEvicting == false` on the node
hosting a replica, it may set `status.evictionRequested = false` on that replica.
- NOTE: If a new replica already started rebuilding as part of an automatic eviction, it continues to rebuild and
remains in the cluster even after eviction is canceled. It can be cleaned up manually if desired.
Make changes so that:

- The node controller (not the replica controller) sets `replica.spec.evictionRequested = true` when:
- `spec.evictionRequested == true` on the replica's node (similar to existing behavior moved from the replica
controller), OR
- `spec.evictionRequested == true` on the replica's disk. (similar to existing behavior moved from the replica
controller), OR
- `status.Unschedulable == true` on the associated Kubernetes node object and the node drain policy is
`block-for-eviction`, OR
- `status.Unschedulable == true` on the associated Kubernetes node object, the node drain policy is
`block-for-eviction-if-contains-last-replica`, and there are no other PDB-protected replicas for a volume.
- Much of the logic currently used by the instance manager controller to recognize PDB-protected replicas is moved to
utility functions so both the node and instance manager controllers can use it.
- The volume controller uses `replica.spec.evictionRequested == true` in exactly the same way it previously used
`replica.status.evictionRequested` to influence replica scheduling/deletion behavior (e.g. rebuild an extra replica
to replace the evicting one or delete the evicting one once rebuilding is complete).
- The node controller sets `replica.spec.evictionRequested = false` when:
- `spec.evictionRequested == false` on the replica's node, AND
- `spec.evictionRequested == false` on the replica's disk, AND
- `status.Unschedulable == false` on the associated Kubernetes node object, OR
- `status.Unschedulable == true` on the associated Kubernetes node object and the node drain policy is not
`block-for-eviction` or `block-for-eviction-if-contains-last-replica`, OR
- `status.Unschedulable == true` on the associated Kubernetes node object, the node drain policy is
`block-for-eviction-if-contains-last-replica`, and there is another PDB-protected replica.
- The node controller sets `status.autoEvicting = true` when a node has evicting replicas because of the new drain
policies and `status.autoEvicting == false` when it does not. This provides a clue to the user (and the UI) while auto
eviction is ongoing.

### Test plan

Test normal behavior:
Test normal behavior with `block-for-eviction`:

- Create a volume.
- Ensure (through soft anti-affinity, low replica count, and/or enough disks) that an evicted replica of the volume can
Expand All @@ -175,12 +214,36 @@ Test normal behavior:
- While the drain is ongoing:
- Verify that the volume never becomes degraded.
- Verify that `node.status.autoEvicting == true`.
- Verify that `replica.spec.evictionRequested == true`.
- Verify the drain completes.
- Uncordon the node.
- Verify the replica on the drained node has moved to a different one.
- Verify that `node.status.autoEvicting == false`.
- Verify that `replica.spec.evictionRequested == false`.
- Verify the volume's data.

Test unschedulable behavior:
Test normal behavior with `block-for-eviction-if-contains-last-replica`:

- Create one volume with a single replica and another volume with three replicas.
- Ensure (through soft anti-affinity, low replica count, and/or enough disks) that evicted replicas of both volumes can
be scheduled elsewhere.
- Write data to the volumes.
- Drain a node both volumes have a replica scheduled to.
- While the drain is ongoing:
- Verify that the volume with one replica never becomes degraded.
- Verify that the volume with three replicas becomes degraded.
- Verify that `node.status.autoEvicting == true`.
- Verify that `replica.spec.evictionRequested == true` on the replica for the volume that only has one.
- Verify that `replica.spec.evictionRequested == false` on the replica for the volume that has three.
- Verify the drain completes.
- Uncordon the node.
- Verify the replica for the volume with one replica has moved to a different node.
- Verify the replica for the volume with three replicas has not moved.
- Verify that `node.status.autoEvicting == false`.
- Verify that `replica.spec.evictionRequested == false` on all replicas.
- Verify the the data in both volumes.

Test unschedulable behavior with `block-for-eviction`:

- Create a volume.
- Ensure (through soft anti-affinity, high replica count, and/or not enough disks) that an evicted replica of the volume
Expand All @@ -198,8 +261,9 @@ Test unschedulable behavior:

### Upgrade strategy

Add `status.autoEvicting == false` to all `node.longhorn.io` objects during the upgrade. The default node drain policy
remains `Block If Contains Last Replica`, so do not make setting changes.
- Add `status.autoEvicting = false` to all `node.longhorn.io` objects during the upgrade.
- Add `spec.evictionRequested = status.evictionRequested` to all replica objects during the upgrade.
- The default node drain policy remains `Block If Contains Last Replica`, so do not make setting changes.

## Note

Expand Down

0 comments on commit fbd7036

Please sign in to comment.