-
Notifications
You must be signed in to change notification settings - Fork 126
Add proposal for temporary preservation of Failed
machines for diagnostics
#1031
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
# Preservation of Failed Machines | ||
|
||
<!-- TOC --> | ||
|
||
- [Preservation of Failed Machines](#preservation-of-failed-machines) | ||
- [Objective](#objective) | ||
- [Proposal](#proposal) | ||
- [State Machine](#state-machine) | ||
- [Use Cases](#use-cases) | ||
|
||
<!-- /TOC --> | ||
|
||
## Objective | ||
|
||
Currently, the Machine Controller Manager(MCM) moves Machines with errors to the `Unknown` phase, and after the configured `machineHealthTimeout`, to the `Failed` phase. | ||
`Failed` machines are swiftly moved to the `Terminating` phase during which the node is drained and the `Machine` object is deleted. This rapid cleanup prevents SRE/operators/support from conducting an analysis on the VM and makes finding root cause of failure more difficult. | ||
|
||
This document proposes enhancing MCM, such that: | ||
* VMs of `Failed` machines are retained temporarily for analysis | ||
* There is a configurable limit to the number of `Failed` machines that can be preserved | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FYI, it's rare that many nodes need to be investigated. Since configuration is pretty standard across all node of a same worker group, investigating 1 node is usually enough. That being said, I don't have anything against this protection! |
||
* There is a configurable limit to the duration for which such machines are preserved | ||
* Users can specify which healthy machines they would like to preserve in case of failure | ||
* Users can request MCM to release a preserved `Failed` machine, even before the timeout expires, so that MCM can transition the machine to `Terminating` phase and trigger its deletion. | ||
|
||
## Proposal | ||
|
||
In order to achieve the objectives mentioned, the following are proposed: | ||
1. Enhance `machineControllerManager` configuration in the `ShootSpec`, to specify the max number of failed machines to be preserved, | ||
and the time duration for which these machines will be preserved. | ||
``` | ||
machineControllerManager: | ||
failedMachinePreserveMax: 2 | ||
failedMachinePreserveTimeout: 3h | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd put the default of that to 48 or 72 hours as a default. We require expertise from around the world. We also have to include weekends. Since this is going to be a setting shoot-wide, setting it to 3h would in many cases render this feature useless since we can't really change settings in the shoot yaml without the shoot-owner approval. That being said, if the shoot-owner chooses to set a low amount like |
||
``` | ||
* Since gardener worker pool can correspond to `1..N` MachineDeployments depending on number of zones, `failedMachinePreserveMax` will be distributed across N machine deployments. | ||
* `failedMachinePreserveMax` must be chosen such that it can be appropriately distributed across the MachineDeployments. | ||
2. Allow user/operator to explicitly request for preservation of a specific machine with the use of an annotation : `node.machine.sapcloud.io/preserve-when-failed=true`, such that, if it moves to `Failed` phase, the machine is preserved by MCM, provided there is capacity. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Making the annotation be available on both |
||
3. MCM will be modified to introduce a new stage in the `Failed` phase: `machineutils.PreserveFailed`, and a failed machine that is preserved by MCM will be transitioned to this stage after moving to `Failed`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not call the status |
||
4. A machine in `PreserveFailed` stage automatically moves to `Terminating` phase once `failedMachinePreserveTimeout` expires. | ||
* A user/operator can request MCM to stop preserving a machine in `PreservedFailed` stage using the annotation: `node.machine.sapcloud.io/preserve-when-failed=false`. | ||
* For a machine thus annotated, MCM will move it to `Terminating` phase even if `failedMachinePreserveTimeout` has not expired. | ||
5. If an un-annotated machine moves to `Failed` phase, and the `failedMachinePreserveMax` has not been reached, MCM will auto-preserve this machine. | ||
6. Machines of a MachineDeployment in `PreserveFailed` stage will also be counted towards the replica count and the enforcement of maximum machines allowed for the MachineDeployment. | ||
7. At any point in time `machines requested for preservation + machines in PreservedFailed <= failedMachinePreserveMax`. If `machines requested for preservation + machines in PreservedFailed` is at or exceeds `failedMachinePreserveMax` on annotating a machine, the annotation will be deleted by MCM. | ||
|
||
|
||
## State Machine | ||
|
||
The behaviour described above can be summarised using the state machine below: | ||
```mermaid | ||
stateDiagram | ||
direction TBP | ||
state "PreserveFailed | ||
(node drained)" as PreserveFailed | ||
state "Requested | ||
(node & machine annotated)" | ||
as Requested | ||
[*] --> Running | ||
Running --> Requested:annotated with value=true && max not breached | ||
Running --> Running:annotated, but max breached | ||
Requested --> PreserveFailed:on failure | ||
Running --> PreserveFailed:on failure && max not breached | ||
PreserveFailed --> Terminating:after timeout | ||
PreserveFailed --> Terminating:annotated with value=false | ||
Running --> Failed : on failure && max breached | ||
PreserveFailed --> Running : VM recovers | ||
Failed --> Terminating | ||
Terminating --> [*] | ||
``` | ||
thiyyakat marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
In the above state machine, the phase `Running` also includes machines that are in the process of creation for which no errors have been encountered yet. | ||
|
||
## Use Cases: | ||
|
||
### Use Case 1: Proactive Preservation Request | ||
**Scenario:** Operator suspects a machine might fail and wants to ensure preservation for analysis. | ||
#### Steps: | ||
1. Operator annotates node with `node.machine.sapcloud.io/preserve-when-failed=true`, provided `failedMachinePreserveMax` is not violated | ||
2. Machine fails later | ||
3. MCM preserves the machine | ||
4. Operator analyzes the failed VM | ||
|
||
### Use Case 2: Automatic Preservation | ||
**Scenario:** Machine fails unexpectedly, no prior annotation. | ||
#### Steps: | ||
1. Machine transitions to `Failed` phase | ||
2. If `failedMachinePreserveMax` is not breached, machine moved to `PreserveFailed` phase by MCM | ||
3. After `failedMachinePreserveTimeout`, machine is terminated by MCM | ||
Comment on lines
+84
to
+89
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This can be risky of having many nodes in Maybe make that available as an option in the shoot yaml (but defaults to false)? |
||
|
||
### Use Case 3: Capacity Management | ||
**Scenario:** Multiple machines fail when preservation capacity is full. | ||
#### Steps: | ||
1. Machines M1, M2 already preserved (failedMachinePreserveMax = 2) | ||
2. Operator wishes to preserve M3 in case of failure. He increases `failedMachinePreserveMax` to 3, and annotates M3 with `node.machine.sapcloud.io/preserve-when-failed=true`. | ||
3. If M3 fails, machine moved to `PreserveFailed` phase by MCM. | ||
Comment on lines
+91
to
+96
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FYI changing values directly in the shoot YAML is not always a possibility for operators (E.g. no permissions to edit the shoot YAML (but have admin permission in the cluster). That being said, this is also a valid use case There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @thiyyakat |
||
|
||
### Use Case 4: Early Release | ||
**Scenario:** Operator has performed his analysis and no longer requires machine to be preserved | ||
|
||
#### Steps: | ||
1. Machine M1 is in `PreserveFailed` phase | ||
2. Operator adds: `node.machine.sapcloud.io/preserve-when-failed=false` to node. | ||
3. MCM transitions M1 to `Terminating` even though `failedMachinePreserveTimeout` has not expired | ||
4. Capacity becomes available for preserving future `Failed` machines. | ||
|
||
## Open Point | ||
|
||
How will MCM provide the user with the option to drain a node when it is in `PreserveFailed` stage? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
## Limitations | ||
|
||
1. During rolling updates we will NOT honor preserving Machines. The Machine will be replaced with a healthy one if it moves to Failed phase. | ||
2. Since gardener worker pool can correspond to 1..N MachineDeployments depending on number of zones, we will need to distribute the `failedMachinePreserveMax` across N machine deployments. | ||
So, even if there are no failed machines preserved in other zones, the max per zone would still be enforced. Hence, the value of `failedMachinePreserveMax` should be chosen appropriately. | ||
Comment on lines
+114
to
+115
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is confusing to me, but I might not understand the details also... Do you mean that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR: I would not restrict this feature to failing node.
Eg. it can happen that we detect problems with essentially all pods on a node but that this node doesn't report any condition failures (aka the node/machine will not be in failed state).
From an SRE perspective, we want to be as available as possible. Thus, in these kind of cases, we would cordon / drain the pods (except daemonSets) and start investigating the node. Furthermore, since expertise is spread around the globe, we sometimes need to keep a node in a cordoned state for 24-28 hours in order to investigate the root cause with the right area's expert. However, if a node is cordoned with not workload on it, it has very high chances to be scheduled for scale down by CA first.
Thus, this feature should also work for non failing nodes in order to cover all cases.