Skip to content

Conversation

clobrano
Copy link
Contributor

This PR introduces a new test case that validates etcd recovery after parallel failure and restart of both nodes in a two-node OpenShift cluster.

The failure is simulated via helper functions that leverage libvirt/virsh APIs to manage VM lifecycle (ungraceful shutdown, state verification, restart).

This commit introduces infrastructure for testing two-node OpenShift
cluster disruption scenarios via hypervisor operations, enabling tests
that replace or recover control plane nodes through VM management.

New utility libraries (test/extended/two_node/utils/):

Core utilities:
- file.go: Temp file creation, template processing, resource backup
- retry.go: Configurable retry logic and polling with timeouts
- ssh.go: Direct and two-hop SSH (local→hypervisor→node)
- validation.go: Input validation and security checks
- etcd.go: Error classification, job management, polling
- hypervisor.go: Connectivity verification and config helpers
- libvirt.go: VM lifecycle via virsh (define/start/stop/destroy)
- pacemaker.go: Cluster operations (node add/remove, status)

Framework integration:
- Added HypervisorConfig to cluster discovery
- New --with-hypervisor-json flag for SSH configuration
- Added [Requires:HypervisorSSHConfig] skip annotation
- Test helpers: GetHypervisorConfig(), HasHypervisorConfig()

Key features: Two-hop SSH support, VM management, etcd/Pacemaker
control, security-focused validation, intelligent retry logic.
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 10, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 10, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 10, 2025

@clobrano: This pull request references OCPEDGE-1565 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

This PR introduces a new test case that validates etcd recovery after parallel failure and restart of both nodes in a two-node OpenShift cluster.

The failure is simulated via helper functions that leverage libvirt/virsh APIs to manage VM lifecycle (ungraceful shutdown, state verification, restart).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

openshift-ci bot commented Oct 10, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@clobrano clobrano marked this pull request as ready for review October 10, 2025 17:05
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 10, 2025
c.HypervisorConfig.PrivateKeyPath = sshConfig.PrivateKeyPath

// Validate that the private key file exists
if _, err = os.Stat(c.HypervisorConfig.PrivateKeyPath); os.IsNotExist(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these setup errors be returned so we can have an "expect no errors" check after the setup is called?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed returned as a named return value, so the next question is probably "why use named return values?" 😄. This function returns four values, and I thought it'd be too verbose to write them all out for every early return, but I'm not totally committed to that choice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - this is "new" syntax to me. I see where the variable is declared now so I now understand your intent. No preference from me! :)


// findVMByNodeName finds a VM that corresponds to an OpenShift node
// This uses a simple name-based correlation approach
func findVMByNodeName(nodeName string, sshConfig *core.SSHConfig, knownHostsPath string) (string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a common enough utility that we can add it to the shared utilities. I would throw it in the libirt library since it's an extension of virsh list call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Speaking of that utilities, I kept WaitForVMToStart to avoid breaking any existing code in your tests, but since I introduced WaitForVMState, it can totally replace the first one. Would you mind if I just remove WaitForVMToStart?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say - keep the base commit the same for consistency, but then add the updated function & removal as part of your test commit. If this merges first, I'll update my test to leverage it and remove the extras. If this merges second, you'll have already done so in your commit that adds this test.

Copy link
Contributor

@jaypoulz jaypoulz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few observations. Looks good to me!

@clobrano
Copy link
Contributor Author

/retest-required

This commit introduces a new test case that validates etcd recovery
after parallel failure and restart of both nodes in a two-node
OpenShift cluster.

The test follows a sequential flow:
1. Stop both VMs and verify they reach shut off state
2. Restart both VMs and verify they reach running state
3. Validate both etcd members recover to healthy, voting member state
@clobrano clobrano force-pushed the tnf-e2e-double-node-failure branch from 177f0bd to 54fcd00 Compare October 13, 2025 12:20
The functionality of `WaitForVMToStart` overlaps with the
broader `WaitForVMState` function
@jaypoulz
Copy link
Contributor

/approve

@jaypoulz
Copy link
Contributor

/lgtm

Copy link
Contributor

openshift-ci bot commented Oct 13, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: clobrano, jaypoulz
Once this PR has been reviewed and has the lgtm label, please assign dennisperiquet for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 13, 2025
@jaypoulz
Copy link
Contributor

We'll need an approval from #forum-ocp-release-oversight

@clobrano
Copy link
Contributor Author

/retest-required

Copy link
Contributor

openshift-ci bot commented Oct 13, 2025

@clobrano: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn 0c378ae link false /test okd-scos-e2e-aws-ovn
ci/prow/verify 0c378ae link true /test verify

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants