OCPEDGE-1565: [TNF] Add double node failure recovery test #30370

clobrano · 2025-10-10T14:44:57Z

This PR introduces a new test case that validates etcd recovery after parallel failure and restart of both nodes in a two-node OpenShift cluster.

The failure is simulated via helper functions that leverage libvirt/virsh APIs to manage VM lifecycle (ungraceful shutdown, state verification, restart).

This commit introduces infrastructure for testing two-node OpenShift cluster disruption scenarios via hypervisor operations, enabling tests that replace or recover control plane nodes through VM management. New utility libraries (test/extended/two_node/utils/): Core utilities: - file.go: Temp file creation, template processing, resource backup - retry.go: Configurable retry logic and polling with timeouts - ssh.go: Direct and two-hop SSH (local→hypervisor→node) - validation.go: Input validation and security checks - etcd.go: Error classification, job management, polling - hypervisor.go: Connectivity verification and config helpers - libvirt.go: VM lifecycle via virsh (define/start/stop/destroy) - pacemaker.go: Cluster operations (node add/remove, status) Framework integration: - Added HypervisorConfig to cluster discovery - New --with-hypervisor-json flag for SSH configuration - Added [Requires:HypervisorSSHConfig] skip annotation - Test helpers: GetHypervisorConfig(), HasHypervisorConfig() Key features: Two-hop SSH support, VM management, etcd/Pacemaker control, security-focused validation, intelligent retry logic.

openshift-ci-robot · 2025-10-10T14:45:02Z

@clobrano: This pull request references OCPEDGE-1565 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

This PR introduces a new test case that validates etcd recovery after parallel failure and restart of both nodes in a two-node OpenShift cluster.

The failure is simulated via helper functions that leverage libvirt/virsh APIs to manage VM lifecycle (ungraceful shutdown, state verification, restart).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2025-10-10T14:45:03Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

jaypoulz · 2025-10-10T19:13:03Z

test/extended/two_node/tnf_recovery.go

+	c.HypervisorConfig.PrivateKeyPath = sshConfig.PrivateKeyPath
+
+	// Validate that the private key file exists
+	if _, err = os.Stat(c.HypervisorConfig.PrivateKeyPath); os.IsNotExist(err) {


Should these setup errors be returned so we can have an "expect no errors" check after the setup is called?

This is indeed returned as a named return value, so the next question is probably "why use named return values?" 😄. This function returns four values, and I thought it'd be too verbose to write them all out for every early return, but I'm not totally committed to that choice.

Ah - this is "new" syntax to me. I see where the variable is declared now so I now understand your intent. No preference from me! :)

jaypoulz · 2025-10-10T19:16:05Z

test/extended/two_node/tnf_recovery.go

+
+// findVMByNodeName finds a VM that corresponds to an OpenShift node
+// This uses a simple name-based correlation approach
+func findVMByNodeName(nodeName string, sshConfig *core.SSHConfig, knownHostsPath string) (string, error) {


Probably a common enough utility that we can add it to the shared utilities. I would throw it in the libirt library since it's an extension of virsh list call.

Good point. Speaking of that utilities, I kept WaitForVMToStart to avoid breaking any existing code in your tests, but since I introduced WaitForVMState, it can totally replace the first one. Would you mind if I just remove WaitForVMToStart?

I would say - keep the base commit the same for consistency, but then add the updated function & removal as part of your test commit. If this merges first, I'll update my test to leverage it and remove the extras. If this merges second, you'll have already done so in your commit that adds this test.

jaypoulz

Just a few observations. Looks good to me!

clobrano · 2025-10-11T03:52:07Z

/retest-required

This commit introduces a new test case that validates etcd recovery after parallel failure and restart of both nodes in a two-node OpenShift cluster. The test follows a sequential flow: 1. Stop both VMs and verify they reach shut off state 2. Restart both VMs and verify they reach running state 3. Validate both etcd members recover to healthy, voting member state

The functionality of `WaitForVMToStart` overlaps with the broader `WaitForVMState` function

jaypoulz · 2025-10-13T15:09:34Z

/approve

jaypoulz · 2025-10-13T15:09:52Z

/lgtm

openshift-ci · 2025-10-13T15:09:56Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: clobrano, jaypoulz
Once this PR has been reviewed and has the lgtm label, please assign dennisperiquet for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jaypoulz · 2025-10-13T15:10:08Z

We'll need an approval from #forum-ocp-release-oversight

clobrano · 2025-10-13T18:40:52Z

/retest-required

openshift-ci · 2025-10-13T19:52:09Z

@clobrano: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/okd-scos-e2e-aws-ovn	`0c378ae`	link	false	`/test okd-scos-e2e-aws-ovn`
ci/prow/verify	`0c378ae`	link	true	`/test verify`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 10, 2025

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 10, 2025

clobrano marked this pull request as ready for review October 10, 2025 17:05

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 10, 2025

openshift-ci bot requested review from eggfoobar and p0lyn0mial October 10, 2025 17:06

jaypoulz reviewed Oct 10, 2025

View reviewed changes

jaypoulz approved these changes Oct 10, 2025

View reviewed changes

clobrano force-pushed the tnf-e2e-double-node-failure branch from 177f0bd to 54fcd00 Compare October 13, 2025 12:20

two_node/libvirt.go: Remove Redundant WaitForVMToStart

0c378ae

The functionality of `WaitForVMToStart` overlaps with the broader `WaitForVMState` function

openshift-ci bot assigned jaypoulz Oct 13, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 13, 2025

OCPEDGE-1565: [TNF] Add double node failure recovery test #30370

Are you sure you want to change the base?

OCPEDGE-1565: [TNF] Add double node failure recovery test #30370

Uh oh!

Conversation

clobrano commented Oct 10, 2025

Uh oh!

openshift-ci-robot commented Oct 10, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Oct 10, 2025

Uh oh!

jaypoulz Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

clobrano Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

jaypoulz Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

jaypoulz Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

clobrano Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

jaypoulz Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

jaypoulz left a comment

Choose a reason for hiding this comment

Uh oh!

clobrano commented Oct 11, 2025

Uh oh!

jaypoulz commented Oct 13, 2025

Uh oh!

jaypoulz commented Oct 13, 2025

Uh oh!

openshift-ci bot commented Oct 13, 2025

Uh oh!

jaypoulz commented Oct 13, 2025

Uh oh!

clobrano commented Oct 13, 2025

Uh oh!

openshift-ci bot commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

openshift-ci-robot commented Oct 10, 2025 •

edited by openshift-ci bot

Loading