diff --git a/docs/proposals/failed-machine-preservation.md b/docs/proposals/failed-machine-preservation.md new file mode 100644 index 000000000..e1dde5f67 --- /dev/null +++ b/docs/proposals/failed-machine-preservation.md @@ -0,0 +1,115 @@ +# Preservation of Failed Machines + + + +- [Preservation of Failed Machines](#preservation-of-failed-machines) + - [Objective](#objective) + - [Proposal](#proposal) + - [State Machine](#state-machine) + - [Use Cases](#use-cases) + + + +## Objective + +Currently, the Machine Controller Manager(MCM) moves Machines with errors to the `Unknown` phase, and after the configured `machineHealthTimeout`, to the `Failed` phase. +`Failed` machines are swiftly moved to the `Terminating` phase during which the node is drained and the `Machine` object is deleted. This rapid cleanup prevents SRE/operators/support from conducting an analysis on the VM and makes finding root cause of failure more difficult. + +This document proposes enhancing MCM, such that: +* VMs of `Failed` machines are retained temporarily for analysis +* There is a configurable limit to the number of `Failed` machines that can be preserved +* There is a configurable limit to the duration for which such machines are preserved +* Users can specify which healthy machines they would like to preserve in case of failure +* Users can request MCM to release a preserved `Failed` machine, even before the timeout expires, so that MCM can transition the machine to `Terminating` phase and trigger its deletion. + +## Proposal + +In order to achieve the objectives mentioned, the following are proposed: +1. Enhance `machineControllerManager` configuration in the `ShootSpec`, to specify the max number of failed machines to be preserved, +and the time duration for which these machines will be preserved. + ``` + machineControllerManager: + failedMachinePreserveMax: 2 + failedMachinePreserveTimeout: 3h + ``` + * Since gardener worker pool can correspond to `1..N` MachineDeployments depending on number of zones, `failedMachinePreserveMax` will be distributed across N machine deployments. + * `failedMachinePreserveMax` must be chosen such that it can be appropriately distributed across the MachineDeployments. +2. Allow user/operator to explicitly request for preservation of a specific machine with the use of an annotation : `node.machine.sapcloud.io/preserve-when-failed=true`, such that, if it moves to `Failed` phase, the machine is preserved by MCM, provided there is capacity. +3. MCM will be modified to introduce a new stage in the `Failed` phase: `machineutils.PreserveFailed`, and a failed machine that is preserved by MCM will be transitioned to this stage after moving to `Failed`. +4. A machine in `PreserveFailed` stage automatically moves to `Terminating` phase once `failedMachinePreserveTimeout` expires. + * A user/operator can request MCM to stop preserving a machine in `PreservedFailed` stage using the annotation: `node.machine.sapcloud.io/preserve-when-failed=false`. + * For a machine thus annotated, MCM will move it to `Terminating` phase even if `failedMachinePreserveTimeout` has not expired. +5. If an un-annotated machine moves to `Failed` phase, and the `failedMachinePreserveMax` has not been reached, MCM will auto-preserve this machine. +6. Machines of a MachineDeployment in `PreserveFailed` stage will also be counted towards the replica count and the enforcement of maximum machines allowed for the MachineDeployment. +7. At any point in time `machines requested for preservation + machines in PreservedFailed <= failedMachinePreserveMax`. If `machines requested for preservation + machines in PreservedFailed` is at or exceeds `failedMachinePreserveMax` on annotating a machine, the annotation will be deleted by MCM. + + +## State Machine + +The behaviour described above can be summarised using the state machine below: +```mermaid +stateDiagram + direction TBP + state "PreserveFailed + (node drained)" as PreserveFailed + state "Requested + (node & machine annotated)" + as Requested + [*] --> Running + Running --> Requested:annotated with value=true && max not breached + Running --> Running:annotated, but max breached + Requested --> PreserveFailed:on failure + Running --> PreserveFailed:on failure && max not breached + PreserveFailed --> Terminating:after timeout + PreserveFailed --> Terminating:annotated with value=false + Running --> Failed : on failure && max breached + PreserveFailed --> Running : VM recovers + Failed --> Terminating + Terminating --> [*] + +``` + +In the above state machine, the phase `Running` also includes machines that are in the process of creation for which no errors have been encountered yet. + +## Use Cases: + +### Use Case 1: Proactive Preservation Request +**Scenario:** Operator suspects a machine might fail and wants to ensure preservation for analysis. +#### Steps: +1. Operator annotates node with `node.machine.sapcloud.io/preserve-when-failed=true`, provided `failedMachinePreserveMax` is not violated +2. Machine fails later +3. MCM preserves the machine +4. Operator analyzes the failed VM + +### Use Case 2: Automatic Preservation +**Scenario:** Machine fails unexpectedly, no prior annotation. +#### Steps: +1. Machine transitions to `Failed` phase +2. If `failedMachinePreserveMax` is not breached, machine moved to `PreserveFailed` phase by MCM +3. After `failedMachinePreserveTimeout`, machine is terminated by MCM + +### Use Case 3: Capacity Management +**Scenario:** Multiple machines fail when preservation capacity is full. +#### Steps: +1. Machines M1, M2 already preserved (failedMachinePreserveMax = 2) +2. Operator wishes to preserve M3 in case of failure. He increases `failedMachinePreserveMax` to 3, and annotates M3 with `node.machine.sapcloud.io/preserve-when-failed=true`. +3. If M3 fails, machine moved to `PreserveFailed` phase by MCM. + +### Use Case 4: Early Release +**Scenario:** Operator has performed his analysis and no longer requires machine to be preserved + +#### Steps: +1. Machine M1 is in `PreserveFailed` phase +2. Operator adds: `node.machine.sapcloud.io/preserve-when-failed=false` to node. +3. MCM transitions M1 to `Terminating` even though `failedMachinePreserveTimeout` has not expired +4. Capacity becomes available for preserving future `Failed` machines. + +## Open Point + +How will MCM provide the user with the option to drain a node when it is in `PreserveFailed` stage? + +## Limitations + +1. During rolling updates we will NOT honor preserving Machines. The Machine will be replaced with a healthy one if it moves to Failed phase. +2. Since gardener worker pool can correspond to 1..N MachineDeployments depending on number of zones, we will need to distribute the `failedMachinePreserveMax` across N machine deployments. +So, even if there are no failed machines preserved in other zones, the max per zone would still be enforced. Hence, the value of `failedMachinePreserveMax` should be chosen appropriately.