WIP: Core libraries for two node disruptive tests. #30332

jaypoulz · 2025-10-02T19:15:58Z

No description provided.

openshift-ci · 2025-10-02T19:17:10Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jaypoulz
Once this PR has been reviewed and has the lgtm label, please assign neisw for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fonta-rh · 2025-10-03T08:47:35Z

test/extended/two_node/utils/hypervisor.go

+
+// VerifyHypervisorConnectivity verifies SSH connectivity to the hypervisor and checks
+// that virsh and libvirt are available.
+func VerifyHypervisorConnectivity(sshConfig *SSHConfig, knownHostsPath string) error {


Would rename to VerifyHypervisor or VerifyHypervisorAvailability (as you're checking for more than connectivity and there is already a VerifyConnectivity function)

fonta-rh · 2025-10-03T09:05:21Z

test/extended/two_node/utils/pacemaker.go

+
+	// SSH to hypervisor, then to surviving node to run pcs debug-start
+	// We need to chain the SSH commands: host -> hypervisor -> surviving node
+	output, stderr, err := PcsCommand(fmt.Sprintf("%s && %s", pcsResourceDebugStop, formatPcsCommandString(pcsResourceDebugStart, pcsResourceDebugStartEnvVars)), sshConfig, localKnownHostsPath, remoteKnownHostsPath, nodeIP)


I think the inconsistency here between having a "formatPcsCommandString" for the second command and directly calling fmt.Sprintf for the first one (in the first parameter) makes this a little harder to read than it should. Is it worth also encapsulating the first one in a function?

Reading the code further, why aren't we doing this like the PcsDebugStart below, that uses ExecuteRemoteSSHCommand? That's much easier to parse

I just removed the utility function and replaced all of the pcs commands with the formatPcsCommand string option for simplicity :)

fonta-rh · 2025-10-03T09:14:57Z

test/extended/two_node/utils/ssh.go

+
+// ExecuteRemoteSSHCommand executes a command on an OpenShift node via two-hop SSH (local → hypervisor → node).
+// Uses 'core' user for the node connection.
+func ExecuteRemoteSSHCommand(nodeIP, command string, sshConfig *SSHConfig, localKnownHostsPath, remoteKnownHostsPath string) (string, string, error) {


I would rename nodeIP to remoteNodeIP, to make it more explicit. Also sshConfig to hypervisorSSHConfig. This way it's easier to know what info each parameter is providing to the function

openshift-trt · 2025-10-03T21:22:30Z

Job Failure Risk Analysis for sha: ef0ff06

Job Name	Failure Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-single-node-upgrade	IncompleteTests Tests for this run (32) are below the historical average (3677): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-okd-scos-e2e-aws-ovn	IncompleteTests Tests for this run (140) are below the historical average (1532): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

openshift-trt · 2025-10-08T04:54:18Z

Job Failure Risk Analysis for sha: 74d8539

Job Name	Failure Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-single-node-upgrade	IncompleteTests

This commit introduces infrastructure for testing two-node OpenShift cluster disruption scenarios via hypervisor operations, enabling tests that replace or recover control plane nodes through VM management. New utility libraries (test/extended/two_node/utils/): Core utilities: - file.go: Temp file creation, template processing, resource backup - retry.go: Configurable retry logic and polling with timeouts - ssh.go: Direct and two-hop SSH (local→hypervisor→node) - validation.go: Input validation and security checks - etcd.go: Error classification, job management, polling - hypervisor.go: Connectivity verification and config helpers - libvirt.go: VM lifecycle via virsh (define/start/stop/destroy) - pacemaker.go: Cluster operations (node add/remove, status) Framework integration: - Added HypervisorConfig to cluster discovery - New --with-hypervisor-json flag for SSH configuration - Added [Requires:HypervisorSSHConfig] skip annotation - Test helpers: GetHypervisorConfig(), HasHypervisorConfig() Key features: Two-hop SSH support, VM management, etcd/Pacemaker control, security-focused validation, intelligent retry logic.

openshift-ci · 2025-10-08T18:58:01Z

@jaypoulz: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-single-node-upgrade	`74d8539`	link	false	`/test e2e-aws-ovn-single-node-upgrade`
ci/prow/e2e-gcp-ovn-upgrade	`64d4f0e`	link	true	`/test e2e-gcp-ovn-upgrade`
ci/prow/e2e-aws-ovn-single-node	`74d8539`	link	false	`/test e2e-aws-ovn-single-node`
ci/prow/e2e-openstack-ovn	`74d8539`	link	false	`/test e2e-openstack-ovn`
ci/prow/e2e-aws-ovn-single-node-serial	`74d8539`	link	false	`/test e2e-aws-ovn-single-node-serial`
ci/prow/okd-scos-e2e-aws-ovn	`64d4f0e`	link	false	`/test okd-scos-e2e-aws-ovn`
ci/prow/e2e-gcp-ovn	`64d4f0e`	link	true	`/test e2e-gcp-ovn`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 2, 2025

openshift-ci bot requested review from jeff-roche and sjenning October 2, 2025 19:17

fonta-rh reviewed Oct 3, 2025

View reviewed changes

jaypoulz force-pushed the two-node-disruption-test-libs branch 5 times, most recently from 323411e to ef0ff06 Compare October 3, 2025 15:38

jaypoulz force-pushed the two-node-disruption-test-libs branch from ef0ff06 to 74d8539 Compare October 8, 2025 00:33

jaypoulz force-pushed the two-node-disruption-test-libs branch from 74d8539 to a97ec55 Compare October 8, 2025 13:12

jaypoulz force-pushed the two-node-disruption-test-libs branch from a97ec55 to 64d4f0e Compare October 8, 2025 14:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: Core libraries for two node disruptive tests. #30332

WIP: Core libraries for two node disruptive tests. #30332

Uh oh!

jaypoulz commented Oct 2, 2025

Uh oh!

openshift-ci bot commented Oct 2, 2025

Uh oh!

fonta-rh Oct 3, 2025

Uh oh!

fonta-rh Oct 3, 2025

Uh oh!

fonta-rh Oct 3, 2025

Uh oh!

jaypoulz Oct 3, 2025

Uh oh!

fonta-rh Oct 3, 2025

Uh oh!

openshift-trt bot commented Oct 3, 2025

Uh oh!

openshift-trt bot commented Oct 8, 2025

Uh oh!

openshift-ci bot commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WIP: Core libraries for two node disruptive tests. #30332

Are you sure you want to change the base?

WIP: Core libraries for two node disruptive tests. #30332

Uh oh!

Conversation

jaypoulz commented Oct 2, 2025

Uh oh!

openshift-ci bot commented Oct 2, 2025

Uh oh!

fonta-rh Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

fonta-rh Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

fonta-rh Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

jaypoulz Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

fonta-rh Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

openshift-trt bot commented Oct 3, 2025

Uh oh!

openshift-trt bot commented Oct 8, 2025

Uh oh!

openshift-ci bot commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants