-
Notifications
You must be signed in to change notification settings - Fork 4.8k
OCPEDGE-1565: [TNF] Add double node failure recovery test #30370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This commit introduces infrastructure for testing two-node OpenShift cluster disruption scenarios via hypervisor operations, enabling tests that replace or recover control plane nodes through VM management. New utility libraries (test/extended/two_node/utils/): Core utilities: - file.go: Temp file creation, template processing, resource backup - retry.go: Configurable retry logic and polling with timeouts - ssh.go: Direct and two-hop SSH (local→hypervisor→node) - validation.go: Input validation and security checks - etcd.go: Error classification, job management, polling - hypervisor.go: Connectivity verification and config helpers - libvirt.go: VM lifecycle via virsh (define/start/stop/destroy) - pacemaker.go: Cluster operations (node add/remove, status) Framework integration: - Added HypervisorConfig to cluster discovery - New --with-hypervisor-json flag for SSH configuration - Added [Requires:HypervisorSSHConfig] skip annotation - Test helpers: GetHypervisorConfig(), HasHypervisorConfig() Key features: Two-hop SSH support, VM management, etcd/Pacemaker control, security-focused validation, intelligent retry logic.
@clobrano: This pull request references OCPEDGE-1565 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Skipping CI for Draft Pull Request. |
c.HypervisorConfig.PrivateKeyPath = sshConfig.PrivateKeyPath | ||
|
||
// Validate that the private key file exists | ||
if _, err = os.Stat(c.HypervisorConfig.PrivateKeyPath); os.IsNotExist(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these setup errors be returned so we can have an "expect no errors" check after the setup is called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is indeed returned as a named return value, so the next question is probably "why use named return values?" 😄. This function returns four values, and I thought it'd be too verbose to write them all out for every early return, but I'm not totally committed to that choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah - this is "new" syntax to me. I see where the variable is declared now so I now understand your intent. No preference from me! :)
|
||
// findVMByNodeName finds a VM that corresponds to an OpenShift node | ||
// This uses a simple name-based correlation approach | ||
func findVMByNodeName(nodeName string, sshConfig *core.SSHConfig, knownHostsPath string) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a common enough utility that we can add it to the shared utilities. I would throw it in the libirt library since it's an extension of virsh list call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Speaking of that utilities, I kept WaitForVMToStart
to avoid breaking any existing code in your tests, but since I introduced WaitForVMState
, it can totally replace the first one. Would you mind if I just remove WaitForVMToStart
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say - keep the base commit the same for consistency, but then add the updated function & removal as part of your test commit. If this merges first, I'll update my test to leverage it and remove the extras. If this merges second, you'll have already done so in your commit that adds this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few observations. Looks good to me!
/retest-required |
This commit introduces a new test case that validates etcd recovery after parallel failure and restart of both nodes in a two-node OpenShift cluster. The test follows a sequential flow: 1. Stop both VMs and verify they reach shut off state 2. Restart both VMs and verify they reach running state 3. Validate both etcd members recover to healthy, voting member state
177f0bd
to
54fcd00
Compare
The functionality of `WaitForVMToStart` overlaps with the broader `WaitForVMState` function
/approve |
/lgtm |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: clobrano, jaypoulz The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
We'll need an approval from #forum-ocp-release-oversight |
/retest-required |
@clobrano: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This PR introduces a new test case that validates etcd recovery after parallel failure and restart of both nodes in a two-node OpenShift cluster.
The failure is simulated via helper functions that leverage libvirt/virsh APIs to manage VM lifecycle (ungraceful shutdown, state verification, restart).