From 9fbdb30a8a7d6b933a8d2a804ebd161b32b1e42b Mon Sep 17 00:00:00 2001 From: thiyyakat Date: Mon, 22 Sep 2025 15:15:10 +0530 Subject: [PATCH 1/4] Add proposal for preservation of failed machines --- docs/proposals/failed-machine-preservation.md | 103 ++++++++++++++++++ 1 file changed, 103 insertions(+) create mode 100644 docs/proposals/failed-machine-preservation.md diff --git a/docs/proposals/failed-machine-preservation.md b/docs/proposals/failed-machine-preservation.md new file mode 100644 index 000000000..684e59826 --- /dev/null +++ b/docs/proposals/failed-machine-preservation.md @@ -0,0 +1,103 @@ +# Preservation of Failed Machines + + + +- [Preservation of Failed Machines](#preservation-of-failed-machines) + - [Objective](#objective) + - [Solution Design](#solution-design) + - [State Machine](#state-machine) + - [Use Cases](#use-cases) + + + + +## Objective + +Currently, the Machine Controller Manager(MCM) moves Machines with errors to the `Unknown` phase, and after the configured `machineHealthTimeout` seconds, to the `Failed` phase. +`Failed` machines are swiftly moved to the `Terminating` phase during which the node is drained and the `Machine` object is deleted. This rapid cleanup prevents SRE/operators/support from conducting an analysis on the VM and makes finding root cause of failure more difficult. + +This document proposes enhancing MCM, such that: +* VMs of `Failed` machines are retained temporarily for analysis +* There is a configurable limit to the number of `Failed` machines that can be preserved +* There is a configurable limit to the duration for which such machines are preserved +* Users can specify which healthy machines they would like to preserve in case of failure +* Users can request MCM to delete a preserved `Failed` machine, even before the timeout expires + +## Solution Design + +In order to achieve the objectives mentioned, the following are proposed: +1. Enhance `machineControllerManager` configuration in the `ShootSpec`, to specify the max number of failed machines to be preserved, +and the time duration for which these machines will be preserved. + ``` + machineControllerManager: + failedMachinePreserveMax: 2 + failedMachinePreserveTimeout: 3h + ``` + * Since gardener worker pool can correspond to `1..N` MachineDeployments depending on number of zones, `failedMachinePreserveMax` will be distributed across N machine deployments. + * `failedMachinePreserveMax` must be chosen such that it can be appropriately distributed across the MachineDeployments. +2. Allow user/operator to explicitly request for preservation of a machine if it moves to `Failed` phase with the use of an annotation : `node.machine.sapcloud.io/preserve-when-failed=true`. +When such an annotated machine transitions from `Unknown` to `Failed`, it is prevented from moving to `Terminating` phase until `failedMachinePreserveTimeout` expires. + * A user/operator can request MCM to stop preserving a preserved `Failed` machine by adding/modifying the annotation: `node.machine.sapcloud.io/preserve-when-failed=false`. + * For a machine thus annotated, MCM will move it to `Terminating` phase even if `failedMachinePreserveTimeout` has not expired. +3. If an un-annotated machine moves to `Failed` phase, and the `failedMachinePreserveMax` has not been reached, MCM will auto-preserve this machine. +4. MCM will be modified to introduce a new stage in the `Failed` phase: `machineutils.PreserveFailed`, and a failed machine that is preserved by MCM will be transitioned to this stage after moving to `Failed`. + * In this new stage, pods can be evicted and scheduled on other healthy machines, and the user/operator can wait for the corresponding VM to potentially recover. If the machine moves to `Running` phase on recovery, new pods can be scheduled on it. It is yet to be determined whether this feature will be required. + + +## State Machine + +The behaviour described above can be summarised using the state machine below: + +``` +(Running Machine) +├── [User adds `node.machine.sapcloud.io/preserve-when-failed=true`] → (Running + Requested) +└── [Machine fails + capacity available] → (PreserveFailed) + +(Running + Requested) +├── [Machine fails + capacity available] → (PreserveFailed) +├── [Machine fails + no capacity] → Failed → Terminating +└── [User removes `node.machine.sapcloud.io/preserve-when-failed=true`] → (Running) + +(PreserveFailed) +├── [User adds `node.machine.sapcloud.io/preserve-when-failed=false`] → Terminating +└── [failedMachinePreserveTimeout expires] → Terminating + +``` +In the above state machine, the phase `Running` also includes machines that are in the process of creation for which no errors have been encountered yet. +The transition of moving a machine from `PreserveFailed` to `Running` has not been shown since we haven't determined whether it is in scope for the current iteration of this feature. + +## Use Cases: + +### Use Case 1: Proactive Preservation Request +**Scenario:** Operator suspects a machine might fail and wants to ensure preservation for analysis. +#### Steps: +1. Operator annotates node with `node.machine.sapcloud.io/preserve-when-failed=true` +2. Machine fails later +3. MCM preserves the machine (if capacity allows) +4. Operator analyzes the failed VM +5. Operator releases the failed machine by setting `node.machine.sapcloud.io/preserve-when-failed=false` on the node object + +### Use Case 2: Automatic Preservation +**Scenario:** Machine fails unexpectedly, no prior annotation. +#### Steps: +1. Machine transitions to Failed state +2. MCM checks preservation capacity +3. If capacity available, machine moved to `PreserveFailed` phase by MCM +4. After timeout, machine is terminated by MCM + +### Use Case 3: Capacity Management +**Scenario:** Multiple machines fail when preservation capacity is full. +#### Steps: +1. Machines M1, M2 already preserved (capacity = 2) +2. Machine M3 fails with annotation `node.machine.sapcloud.io/preserve-when-failed=true` set +3. MCM cannot preserve M3 due to capacity limits +4. M3 moved from `Failed` to `Terminating` by MCM, following which it is deleted + +### Use Case 4: Early Release +**Scenario:** Operator has performed his analysis and no longer requires machine to be preserved + +#### Steps: +1. Machine M1 is in `PreserveFailed` phase +2. Operator adds: `node.machine.sapcloud.io/preserve-when-failed=false` to node. +3. MCM transitions M1 to `Terminating` +4. Capacity becomes available for preserving future `Failed` machines. From 2286ad7bafbb3aa6e86bd733935a8b8868677ae0 Mon Sep 17 00:00:00 2001 From: thiyyakat Date: Tue, 23 Sep 2025 09:22:55 +0530 Subject: [PATCH 2/4] Add limitations --- docs/proposals/failed-machine-preservation.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/docs/proposals/failed-machine-preservation.md b/docs/proposals/failed-machine-preservation.md index 684e59826..5f1798439 100644 --- a/docs/proposals/failed-machine-preservation.md +++ b/docs/proposals/failed-machine-preservation.md @@ -42,6 +42,7 @@ When such an annotated machine transitions from `Unknown` to `Failed`, it is pre 3. If an un-annotated machine moves to `Failed` phase, and the `failedMachinePreserveMax` has not been reached, MCM will auto-preserve this machine. 4. MCM will be modified to introduce a new stage in the `Failed` phase: `machineutils.PreserveFailed`, and a failed machine that is preserved by MCM will be transitioned to this stage after moving to `Failed`. * In this new stage, pods can be evicted and scheduled on other healthy machines, and the user/operator can wait for the corresponding VM to potentially recover. If the machine moves to `Running` phase on recovery, new pods can be scheduled on it. It is yet to be determined whether this feature will be required. +5. Machines of a MachineDeployment in `PreserveFailed` stage will also be counted towards the replica count and the enforcement of maximum machines allowed for the MachineDeployment. ## State Machine @@ -101,3 +102,9 @@ The transition of moving a machine from `PreserveFailed` to `Running` has not be 2. Operator adds: `node.machine.sapcloud.io/preserve-when-failed=false` to node. 3. MCM transitions M1 to `Terminating` 4. Capacity becomes available for preserving future `Failed` machines. + +## Limitations + +1. During rolling updates we will NOT honor preserving Machines. The Machine will be replaced with a healthy one if it moves to Failed phase. +2. Since gardener worker pool can correspond to 1..N MachineDeployments depending on number of zones, we will need to distribute the `failedMachinePreserveMax` across N machine deployments. +So, even if there are no failed machines preserved in other zones, the max per zone would still be enforced. Hence, the value of `failedMachinePreserveMax` should be chosen appropriately. From fc222106e143967dad3a8f8d029904eefc8e8be6 Mon Sep 17 00:00:00 2001 From: thiyyakat Date: Tue, 23 Sep 2025 16:02:38 +0530 Subject: [PATCH 3/4] Address review comments --- docs/proposals/failed-machine-preservation.md | 85 ++++++++++--------- 1 file changed, 47 insertions(+), 38 deletions(-) diff --git a/docs/proposals/failed-machine-preservation.md b/docs/proposals/failed-machine-preservation.md index 5f1798439..30e6788dd 100644 --- a/docs/proposals/failed-machine-preservation.md +++ b/docs/proposals/failed-machine-preservation.md @@ -4,16 +4,15 @@ - [Preservation of Failed Machines](#preservation-of-failed-machines) - [Objective](#objective) - - [Solution Design](#solution-design) + - [Proposal](#proposal) - [State Machine](#state-machine) - [Use Cases](#use-cases) - ## Objective -Currently, the Machine Controller Manager(MCM) moves Machines with errors to the `Unknown` phase, and after the configured `machineHealthTimeout` seconds, to the `Failed` phase. +Currently, the Machine Controller Manager(MCM) moves Machines with errors to the `Unknown` phase, and after the configured `machineHealthTimeout`, to the `Failed` phase. `Failed` machines are swiftly moved to the `Terminating` phase during which the node is drained and the `Machine` object is deleted. This rapid cleanup prevents SRE/operators/support from conducting an analysis on the VM and makes finding root cause of failure more difficult. This document proposes enhancing MCM, such that: @@ -21,9 +20,9 @@ This document proposes enhancing MCM, such that: * There is a configurable limit to the number of `Failed` machines that can be preserved * There is a configurable limit to the duration for which such machines are preserved * Users can specify which healthy machines they would like to preserve in case of failure -* Users can request MCM to delete a preserved `Failed` machine, even before the timeout expires +* Users can request MCM to release a preserved `Failed` machine, even before the timeout expires, so that MCM can transition the machine to `Terminating` phase and trigger its deletion. -## Solution Design +## Proposal In order to achieve the objectives mentioned, the following are proposed: 1. Enhance `machineControllerManager` configuration in the `ShootSpec`, to specify the max number of failed machines to be preserved, @@ -35,64 +34,70 @@ and the time duration for which these machines will be preserved. ``` * Since gardener worker pool can correspond to `1..N` MachineDeployments depending on number of zones, `failedMachinePreserveMax` will be distributed across N machine deployments. * `failedMachinePreserveMax` must be chosen such that it can be appropriately distributed across the MachineDeployments. -2. Allow user/operator to explicitly request for preservation of a machine if it moves to `Failed` phase with the use of an annotation : `node.machine.sapcloud.io/preserve-when-failed=true`. -When such an annotated machine transitions from `Unknown` to `Failed`, it is prevented from moving to `Terminating` phase until `failedMachinePreserveTimeout` expires. - * A user/operator can request MCM to stop preserving a preserved `Failed` machine by adding/modifying the annotation: `node.machine.sapcloud.io/preserve-when-failed=false`. +2. Allow user/operator to explicitly request for preservation of a specific machine with the use of an annotation : `node.machine.sapcloud.io/preserve-when-failed=true`, such that, if it moves to `Failed` phase, the machine is preserved by MCM, provided there is capacity. +3. MCM will be modified to introduce a new stage in the `Failed` phase: `machineutils.PreserveFailed`, and a failed machine that is preserved by MCM will be transitioned to this stage after moving to `Failed`. +4. A machine in `PreserveFailed` stage automatically moves to `Terminating` phase once `failedMachinePreserveTimeout` expires. + * A user/operator can request MCM to stop preserving a machine in `PreservedFailed` stage using the annotation: `node.machine.sapcloud.io/preserve-when-failed=false`. * For a machine thus annotated, MCM will move it to `Terminating` phase even if `failedMachinePreserveTimeout` has not expired. -3. If an un-annotated machine moves to `Failed` phase, and the `failedMachinePreserveMax` has not been reached, MCM will auto-preserve this machine. -4. MCM will be modified to introduce a new stage in the `Failed` phase: `machineutils.PreserveFailed`, and a failed machine that is preserved by MCM will be transitioned to this stage after moving to `Failed`. - * In this new stage, pods can be evicted and scheduled on other healthy machines, and the user/operator can wait for the corresponding VM to potentially recover. If the machine moves to `Running` phase on recovery, new pods can be scheduled on it. It is yet to be determined whether this feature will be required. -5. Machines of a MachineDeployment in `PreserveFailed` stage will also be counted towards the replica count and the enforcement of maximum machines allowed for the MachineDeployment. +5. If an un-annotated machine moves to `Failed` phase, and the `failedMachinePreserveMax` has not been reached, MCM will auto-preserve this machine. +6. Machines of a MachineDeployment in `PreserveFailed` stage will also be counted towards the replica count and the enforcement of maximum machines allowed for the MachineDeployment. +7. At any point in time `machines requested for preservation + machines in PreservedFailed <= failedMachinePreserveMax`. If `machines requested for preservation + machines in PreservedFailed` is at or exceeds `failedMachinePreserveMax` on annotating a machine, the annotation will be deleted by MCM. ## State Machine The behaviour described above can be summarised using the state machine below: +```mermaid +--- +config: + layout: elk +--- +stateDiagram + direction TBP + state "PreserveFailed + (node drained)" as PreserveFailed + state "Requested + (node & machine annotated)" + as Requested + [*] --> Running + Running --> Requested:annotated with value=true && max not breached + Running --> Running:annotated, but max breached + Requested --> PreserveFailed:on failure + Running --> PreserveFailed:on failure && max not breached + PreserveFailed --> Terminating:after timeout + PreserveFailed --> Terminating:annotated with value=false + Running --> Failed : on failure && max breached + PreserveFailed --> Running : VM recovers + Failed --> Terminating + Terminating --> [*] ``` -(Running Machine) -├── [User adds `node.machine.sapcloud.io/preserve-when-failed=true`] → (Running + Requested) -└── [Machine fails + capacity available] → (PreserveFailed) -(Running + Requested) -├── [Machine fails + capacity available] → (PreserveFailed) -├── [Machine fails + no capacity] → Failed → Terminating -└── [User removes `node.machine.sapcloud.io/preserve-when-failed=true`] → (Running) - -(PreserveFailed) -├── [User adds `node.machine.sapcloud.io/preserve-when-failed=false`] → Terminating -└── [failedMachinePreserveTimeout expires] → Terminating - -``` In the above state machine, the phase `Running` also includes machines that are in the process of creation for which no errors have been encountered yet. -The transition of moving a machine from `PreserveFailed` to `Running` has not been shown since we haven't determined whether it is in scope for the current iteration of this feature. ## Use Cases: ### Use Case 1: Proactive Preservation Request **Scenario:** Operator suspects a machine might fail and wants to ensure preservation for analysis. #### Steps: -1. Operator annotates node with `node.machine.sapcloud.io/preserve-when-failed=true` +1. Operator annotates node with `node.machine.sapcloud.io/preserve-when-failed=true`, provided `failedMachinePreserveMax` is not violated 2. Machine fails later -3. MCM preserves the machine (if capacity allows) +3. MCM preserves the machine 4. Operator analyzes the failed VM -5. Operator releases the failed machine by setting `node.machine.sapcloud.io/preserve-when-failed=false` on the node object ### Use Case 2: Automatic Preservation **Scenario:** Machine fails unexpectedly, no prior annotation. #### Steps: -1. Machine transitions to Failed state -2. MCM checks preservation capacity -3. If capacity available, machine moved to `PreserveFailed` phase by MCM -4. After timeout, machine is terminated by MCM +1. Machine transitions to `Failed` phase +2. If `failedMachinePreserveMax` is not breached, machine moved to `PreserveFailed` phase by MCM +3. After `failedMachinePreserveTimeout`, machine is terminated by MCM ### Use Case 3: Capacity Management **Scenario:** Multiple machines fail when preservation capacity is full. #### Steps: -1. Machines M1, M2 already preserved (capacity = 2) -2. Machine M3 fails with annotation `node.machine.sapcloud.io/preserve-when-failed=true` set -3. MCM cannot preserve M3 due to capacity limits -4. M3 moved from `Failed` to `Terminating` by MCM, following which it is deleted +1. Machines M1, M2 already preserved (failedMachinePreserveMax = 2) +2. Operator wishes to preserve M3 in case of failure. He increases `failedMachinePreserveMax` to 3, and annotates M3 with `node.machine.sapcloud.io/preserve-when-failed=true`. +3. If M3 fails, machine moved to `PreserveFailed` phase by MCM. ### Use Case 4: Early Release **Scenario:** Operator has performed his analysis and no longer requires machine to be preserved @@ -100,9 +105,13 @@ The transition of moving a machine from `PreserveFailed` to `Running` has not be #### Steps: 1. Machine M1 is in `PreserveFailed` phase 2. Operator adds: `node.machine.sapcloud.io/preserve-when-failed=false` to node. -3. MCM transitions M1 to `Terminating` +3. MCM transitions M1 to `Terminating` even though `failedMachinePreserveTimeout` has not expired 4. Capacity becomes available for preserving future `Failed` machines. +## Open Point + +How will MCM provide the user with the option to drain a node when it is in `PreserveFailed` stage? + ## Limitations 1. During rolling updates we will NOT honor preserving Machines. The Machine will be replaced with a healthy one if it moves to Failed phase. From e720c5c82e1ff00a7013a087cd885a7c3ec58abd Mon Sep 17 00:00:00 2001 From: thiyyakat Date: Tue, 23 Sep 2025 16:10:41 +0530 Subject: [PATCH 4/4] Change mermaid layout from elk to default for github support --- docs/proposals/failed-machine-preservation.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/docs/proposals/failed-machine-preservation.md b/docs/proposals/failed-machine-preservation.md index 30e6788dd..e1dde5f67 100644 --- a/docs/proposals/failed-machine-preservation.md +++ b/docs/proposals/failed-machine-preservation.md @@ -48,10 +48,6 @@ and the time duration for which these machines will be preserved. The behaviour described above can be summarised using the state machine below: ```mermaid ---- -config: - layout: elk ---- stateDiagram direction TBP state "PreserveFailed