Add proposal for temporary preservation of `Failed` machines for diagnostics #1031

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

thiyyakat wants to merge 4 commits into gardener:master from thiyyakat:proposal/failed-machine-preserve

Contributor

thiyyakat commented Sep 22, 2025

What this PR does / why we need it:

This PR introduces a proposal to support the temporary preservation of Failed machines for diagnostic purposes.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Release note:

Added proposal for the temporary preservation of `Failed` machines for diagnostic purposes

thiyyakat requested a review from a team as a code owner

September 22, 2025 09:51

gardener-robot-ci-2 added the reviewed/ok-to-test label


          Add proposal for preservation of failed machines

9fbdb30

gardener-robot added the needs/review label

thiyyakat force-pushed the proposal/failed-machine-preserve branch from 20d8f8f to 9fbdb30 Compare

September 22, 2025 09:52

gardener-robot added the size/m label


          Add limitations

2286ad7

unmarshall requested changes

View reviewed changes

docs/proposals/failed-machine-preservation.md Outdated

+              * There is a configurable limit to the number of `Failed` machines that can be preserved
+              * There is a configurable limit to the duration for which such machines are preserved
+              * Users can specify which healthy machines they would like to preserve in case of failure
+              * Users can request MCM to delete a preserved `Failed` machine, even before the timeout expires

Contributor

unmarshall Sep 23, 2025

Suggested change

      
            * Users can request MCM to delete a preserved `Failed` machine, even before the timeout expires
          
            * Users can request MCM to release a preserved `Failed` machine, even before the timeout expires, so that MCM can transition the machine to `Terminating` phase and trigger its deletion.

docs/proposals/failed-machine-preservation.md Outdated

+              * Users can specify which healthy machines they would like to preserve in case of failure
+              * Users can request MCM to delete a preserved `Failed` machine, even before the timeout expires
+              ## Solution Design

Contributor

unmarshall Sep 23, 2025

Suggested change

      
            ## Solution Design
          
            ## Proposal

docs/proposals/failed-machine-preservation.md Outdated

+                  ```
+                  * Since gardener worker pool can correspond to `1..N` MachineDeployments depending on number of zones, `failedMachinePreserveMax` will be distributed across N machine deployments.
+                  * `failedMachinePreserveMax` must be chosen such that it can be appropriately distributed across the MachineDeployments.
+. Allow user/operator to explicitly request for preservation of a machine if it moves to `Failed` phase with the use of an annotation : `node.machine.sapcloud.io/preserve-when-failed=true`.

Contributor

unmarshall Sep 23, 2025

Allow user/operator to explicitly request for preservation of a machine if it moves to Failed phase with the use of an annotation

If the machine already moves to FAILED state and if there is still capacity to preserve MCM will automatically preserve it and if there is no capacity to reserve then it will swiftly move it to termination and trigger its deletion, giving user no chance to influence. I think you meant provide a user an option to preserve specific machines which are not yet in FAILED state, right?

Contributor Author

thiyyakat Sep 23, 2025

Yes. Will reword and disambiguate.

docs/proposals/failed-machine-preservation.md Outdated

+                 * For a machine thus annotated, MCM will move it to `Terminating` phase even if `failedMachinePreserveTimeout` has not expired.
+. If an un-annotated machine moves to `Failed` phase, and the `failedMachinePreserveMax` has not been reached, MCM will auto-preserve this machine.
+. MCM will be modified to introduce a new stage in the `Failed` phase: `machineutils.PreserveFailed`, and a failed machine that is preserved by MCM will be transitioned to this stage after moving to `Failed`.
+                 * In this new stage, pods can be evicted and scheduled on other healthy machines, and the user/operator can wait for the corresponding VM to potentially recover. If the machine moves to `Running` phase on recovery, new pods can be scheduled on it. It is yet to be determined whether this feature will be required.

Contributor

unmarshall Sep 23, 2025

This needs to be reworded and brought out as this is an open question.
Open point:
Should MCM provide an ability to the consumer to trigger drain on the preserved FAILED machine?

docs/proposals/failed-machine-preservation.md Show resolved Hide resolved

docs/proposals/failed-machine-preservation.md Outdated

+              (Running + Requested)
+              ├── [Machine fails + capacity available] → (PreserveFailed)
+              ├── [Machine fails + no capacity] → Failed → Terminating
+              └── [User removes `node.machine.sapcloud.io/preserve-when-failed=true`] → (Running)

Contributor

unmarshall Sep 23, 2025

not clear why User removes node.machine.sapcloud.io/preserve-when-failed=true] → (Running) is here

Contributor Author

thiyyakat Sep 23, 2025

For completeness. If the annotation is removed from a healthy machine, the state transitions to (Running) from (Running+Requested), making it a candidate for auto-preservation by MCM on failure, and also allowing users to re-annotate.

docs/proposals/failed-machine-preservation.md Outdated

+. Machine fails later
+. MCM preserves the machine (if capacity allows)
+. Operator analyzes the failed VM
+. Operator releases the failed machine by setting `node.machine.sapcloud.io/preserve-when-failed=false` on the node object

Contributor

unmarshall Sep 23, 2025

This case only talks about Operator suspects a machine might fail and wants to ensure preservation for analysis.
Should it also include explicit release of the preserved-failed machine? Is that not covered in Use Case 4 (early release)?

gardener-robot added the needs/changes label

thiyyakat added 2 commits

September 23, 2025 16:02


          Address review comments

fc22210


          Change mermaid layout from elk to default for github support

e720c5c

thiyyakat mentioned this pull request

Support Preservation of Failed Machines for diagnostics #1008

Open

etiennnr reviewed

View reviewed changes

etiennnr left a comment

Feel free to reach out to me directly for precision!

docs/proposals/failed-machine-preservation.md

Comment on lines +15 to +16

		Currently, the Machine Controller Manager(MCM) moves Machines with errors to the `Unknown` phase, and after the configured `machineHealthTimeout`, to the `Failed` phase.
		`Failed` machines are swiftly moved to the `Terminating` phase during which the node is drained and the `Machine` object is deleted. This rapid cleanup prevents SRE/operators/support from conducting an analysis on the VM and makes finding root cause of failure more difficult.

etiennnr Sep 24, 2025

TLDR: I would not restrict this feature to failing node.

Eg. it can happen that we detect problems with essentially all pods on a node but that this node doesn't report any condition failures (aka the node/machine will not be in failed state).

From an SRE perspective, we want to be as available as possible. Thus, in these kind of cases, we would cordon / drain the pods (except daemonSets) and start investigating the node. Furthermore, since expertise is spread around the globe, we sometimes need to keep a node in a cordoned state for 24-28 hours in order to investigate the root cause with the right area's expert. However, if a node is cordoned with not workload on it, it has very high chances to be scheduled for scale down by CA first.

Thus, this feature should also work for non failing nodes in order to cover all cases.

docs/proposals/failed-machine-preservation.md

+              This document proposes enhancing MCM, such that:
+              * VMs of `Failed` machines are retained temporarily for analysis
+              * There is a configurable limit to the number of `Failed` machines that can be preserved

etiennnr Sep 24, 2025

FYI, it's rare that many nodes need to be investigated. Since configuration is pretty standard across all node of a same worker group, investigating 1 node is usually enough. That being said, I don't have anything against this protection!

docs/proposals/failed-machine-preservation.md

+                  ```
+                  machineControllerManager:
+                     failedMachinePreserveMax: 2
+                     failedMachinePreserveTimeout: 3h

etiennnr Sep 24, 2025

I'd put the default of that to 48 or 72 hours as a default. We require expertise from around the world. We also have to include weekends. Since this is going to be a setting shoot-wide, setting it to 3h would in many cases render this feature useless since we can't really change settings in the shoot yaml without the shoot-owner approval.

That being said, if the shoot-owner chooses to set a low amount like 3h on purpose, well then they made the choice to limit support on problematic nodes.

docs/proposals/failed-machine-preservation.md

+                  ```
+                  * Since gardener worker pool can correspond to `1..N` MachineDeployments depending on number of zones, `failedMachinePreserveMax` will be distributed across N machine deployments.
+                  * `failedMachinePreserveMax` must be chosen such that it can be appropriately distributed across the MachineDeployments.
+. Allow user/operator to explicitly request for preservation of a specific machine with the use of an annotation : `node.machine.sapcloud.io/preserve-when-failed=true`, such that, if it moves to `Failed` phase, the machine is preserved by MCM, provided there is capacity.

etiennnr Sep 24, 2025

Making the annotation be available on both machine and node object is important IMHO, so that any shoot-operator can investigate a failing node by themselves (self troubleshoot)

docs/proposals/failed-machine-preservation.md

+                  * Since gardener worker pool can correspond to `1..N` MachineDeployments depending on number of zones, `failedMachinePreserveMax` will be distributed across N machine deployments.
+                  * `failedMachinePreserveMax` must be chosen such that it can be appropriately distributed across the MachineDeployments.
+. Allow user/operator to explicitly request for preservation of a specific machine with the use of an annotation : `node.machine.sapcloud.io/preserve-when-failed=true`, such that, if it moves to `Failed` phase, the machine is preserved by MCM, provided there is capacity.
+. MCM will be modified to introduce a new stage in the `Failed` phase: `machineutils.PreserveFailed`, and a failed machine that is preserved by MCM will be transitioned to this stage after moving to `Failed`.

etiennnr Sep 24, 2025

Why not call the status Preserving instead of PreserveFailed (aka as I said above, this would be useful for machines / nodes not detected as failed too)

docs/proposals/failed-machine-preservation.md

Comment on lines +84 to +89

+              ### Use Case 2: Automatic Preservation
+              **Scenario:** Machine fails unexpectedly, no prior annotation.
+              #### Steps:
+. Machine transitions to `Failed` phase
+. If `failedMachinePreserveMax` is not breached, machine moved to `PreserveFailed` phase by MCM
+. After `failedMachinePreserveTimeout`, machine is terminated by MCM

etiennnr Sep 24, 2025

This can be risky of having many nodes in PreserveFailed mainly if we use a relatively high failedMachinePreserveTimeout, even with failedMachinePreserveMax 'cause then a failing worker pool could make multiple nodes getting in PreserveFailed. And depending on how MCM chooses which node to keep / replace in PreserveFailed, could have undesired side effects.

Maybe make that available as an option in the shoot yaml (but defaults to false)?

docs/proposals/failed-machine-preservation.md

Comment on lines +91 to +96

+              ### Use Case 3: Capacity Management
+              **Scenario:** Multiple machines fail when preservation capacity is full.
+              #### Steps:
+. Machines M1, M2 already preserved (failedMachinePreserveMax = 2)
+. Operator wishes to preserve M3 in case of failure. He increases `failedMachinePreserveMax` to 3, and annotates M3 with `node.machine.sapcloud.io/preserve-when-failed=true`.
+. If M3 fails, machine moved to `PreserveFailed` phase by MCM.

etiennnr Sep 24, 2025

FYI changing values directly in the shoot YAML is not always a possibility for operators (E.g. no permissions to edit the shoot YAML (but have admin permission in the cluster). That being said, this is also a valid use case

docs/proposals/failed-machine-preservation.md


		## Open Point

		How will MCM provide the user with the option to drain a node when it is in `PreserveFailed` stage?

etiennnr Sep 24, 2025

kubectl drain? I don't see this as something that should be done my MCM IMHO. The exact purpose of this feature is to be able to test a node even if it's failing, thus workload is sometimes needed in order to troubleshoot. I'd add a warning instead in the documentation to do the drain yourself if it's required.

docs/proposals/failed-machine-preservation.md

Comment on lines +114 to +115

		2. Since gardener worker pool can correspond to 1..N MachineDeployments depending on number of zones, we will need to distribute the `failedMachinePreserveMax` across N machine deployments.
		So, even if there are no failed machines preserved in other zones, the max per zone would still be enforced. Hence, the value of `failedMachinePreserveMax` should be chosen appropriately.

etiennnr Sep 24, 2025

This is confusing to me, but I might not understand the details also... Do you mean that failedMachinePreserveMax is global to all machines / machineDeployment but the maximum amount of nodes in a given MD still takes precedence ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs/changes needs/review reviewed/ok-to-test size/m