Skip to content

Conversation

jaypoulz
Copy link
Contributor

@jaypoulz jaypoulz commented Oct 2, 2025

No description provided.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 2, 2025
@openshift-ci openshift-ci bot requested review from jeff-roche and sjenning October 2, 2025 19:17
Copy link
Contributor

openshift-ci bot commented Oct 2, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jaypoulz
Once this PR has been reviewed and has the lgtm label, please assign neisw for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment


// VerifyHypervisorConnectivity verifies SSH connectivity to the hypervisor and checks
// that virsh and libvirt are available.
func VerifyHypervisorConnectivity(sshConfig *SSHConfig, knownHostsPath string) error {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would rename to VerifyHypervisor or VerifyHypervisorAvailability (as you're checking for more than connectivity and there is already a VerifyConnectivity function)


// SSH to hypervisor, then to surviving node to run pcs debug-start
// We need to chain the SSH commands: host -> hypervisor -> surviving node
output, stderr, err := PcsCommand(fmt.Sprintf("%s && %s", pcsResourceDebugStop, formatPcsCommandString(pcsResourceDebugStart, pcsResourceDebugStartEnvVars)), sshConfig, localKnownHostsPath, remoteKnownHostsPath, nodeIP)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the inconsistency here between having a "formatPcsCommandString" for the second command and directly calling fmt.Sprintf for the first one (in the first parameter) makes this a little harder to read than it should. Is it worth also encapsulating the first one in a function?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the code further, why aren't we doing this like the PcsDebugStart below, that uses ExecuteRemoteSSHCommand? That's much easier to parse

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just removed the utility function and replaced all of the pcs commands with the formatPcsCommand string option for simplicity :)


// ExecuteRemoteSSHCommand executes a command on an OpenShift node via two-hop SSH (local → hypervisor → node).
// Uses 'core' user for the node connection.
func ExecuteRemoteSSHCommand(nodeIP, command string, sshConfig *SSHConfig, localKnownHostsPath, remoteKnownHostsPath string) (string, string, error) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename nodeIP to remoteNodeIP, to make it more explicit. Also sshConfig to hypervisorSSHConfig. This way it's easier to know what info each parameter is providing to the function

@jaypoulz jaypoulz force-pushed the two-node-disruption-test-libs branch 5 times, most recently from 323411e to ef0ff06 Compare October 3, 2025 15:38
Copy link

openshift-trt bot commented Oct 3, 2025

Job Failure Risk Analysis for sha: ef0ff06

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-single-node-upgrade IncompleteTests
Tests for this run (32) are below the historical average (3677): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-okd-scos-e2e-aws-ovn IncompleteTests
Tests for this run (140) are below the historical average (1532): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

@jaypoulz jaypoulz force-pushed the two-node-disruption-test-libs branch from ef0ff06 to 74d8539 Compare October 8, 2025 00:33
Copy link

openshift-trt bot commented Oct 8, 2025

Job Failure Risk Analysis for sha: 74d8539

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-single-node-upgrade IncompleteTests

@jaypoulz jaypoulz force-pushed the two-node-disruption-test-libs branch from 74d8539 to a97ec55 Compare October 8, 2025 13:12
This commit introduces infrastructure for testing two-node OpenShift
cluster disruption scenarios via hypervisor operations, enabling tests
that replace or recover control plane nodes through VM management.

New utility libraries (test/extended/two_node/utils/):

Core utilities:
- file.go: Temp file creation, template processing, resource backup
- retry.go: Configurable retry logic and polling with timeouts
- ssh.go: Direct and two-hop SSH (local→hypervisor→node)
- validation.go: Input validation and security checks
- etcd.go: Error classification, job management, polling
- hypervisor.go: Connectivity verification and config helpers
- libvirt.go: VM lifecycle via virsh (define/start/stop/destroy)
- pacemaker.go: Cluster operations (node add/remove, status)

Framework integration:
- Added HypervisorConfig to cluster discovery
- New --with-hypervisor-json flag for SSH configuration
- Added [Requires:HypervisorSSHConfig] skip annotation
- Test helpers: GetHypervisorConfig(), HasHypervisorConfig()

Key features: Two-hop SSH support, VM management, etcd/Pacemaker
control, security-focused validation, intelligent retry logic.
@jaypoulz jaypoulz force-pushed the two-node-disruption-test-libs branch from a97ec55 to 64d4f0e Compare October 8, 2025 14:56
Copy link
Contributor

openshift-ci bot commented Oct 8, 2025

@jaypoulz: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node-upgrade 74d8539 link false /test e2e-aws-ovn-single-node-upgrade
ci/prow/e2e-gcp-ovn-upgrade 64d4f0e link true /test e2e-gcp-ovn-upgrade
ci/prow/e2e-aws-ovn-single-node 74d8539 link false /test e2e-aws-ovn-single-node
ci/prow/e2e-openstack-ovn 74d8539 link false /test e2e-openstack-ovn
ci/prow/e2e-aws-ovn-single-node-serial 74d8539 link false /test e2e-aws-ovn-single-node-serial
ci/prow/okd-scos-e2e-aws-ovn 64d4f0e link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-gcp-ovn 64d4f0e link true /test e2e-gcp-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants