Skip to content

Conversation

hongkailiu
Copy link
Member

The alert ClusterOperatorDegraded is fatal because it indicates

  • cluster version operator is working (reporting Avaiable=True) but sad (not reporting Failing=False), or
  • some cluster operator is sad (not reporting Degraded=False)

We do not want any of those to happen in the CI upgrade tests.

When a test fails up to the alert, we can check the ClusterVersion manifest usually collected as an artifact of the test to figure out the reason that triggered the alert.

If it was caused by some CO, we should file a bug for the component that CO belongs to.
Otherwise, it is a bug for CVO.

[1]. https://github.com/openshift/cluster-version-operator/blob/d81848ae818b277bdaffa5375ad6366111c4143c/install/0000_90_cluster-version-operator_02_servicemonitor.yaml#L106

The alert ClusterOperatorDegraded is fatal because it indicates

- cluster version operator is working (reporting Avaiable=True)
  but sad (not reporting Failing=False), or
- some cluster operator is sad (not reporting Degraded=False)

We do not want any of those to happen in the CI upgrade tests.

When a test fails up to the alert, we can check the ClusterVersion
manifest usually collected as an artifact of the test to figure
out the reason that triggered the alert.

If it was caused by some CO, we should file a bug for the
component that CO belongs to.
Otherwise, it is a bug for CVO.

[1]. https://github.com/openshift/cluster-version-operator/blob/d81848ae818b277bdaffa5375ad6366111c4143c/install/0000_90_cluster-version-operator_02_servicemonitor.yaml#L106
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 19, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 19, 2025

@hongkailiu: This pull request references OTA-1626 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

The alert ClusterOperatorDegraded is fatal because it indicates

  • cluster version operator is working (reporting Avaiable=True) but sad (not reporting Failing=False), or
  • some cluster operator is sad (not reporting Degraded=False)

We do not want any of those to happen in the CI upgrade tests.

When a test fails up to the alert, we can check the ClusterVersion manifest usually collected as an artifact of the test to figure out the reason that triggered the alert.

If it was caused by some CO, we should file a bug for the component that CO belongs to.
Otherwise, it is a bug for CVO.

[1]. https://github.com/openshift/cluster-version-operator/blob/d81848ae818b277bdaffa5375ad6366111c4143c/install/0000_90_cluster-version-operator_02_servicemonitor.yaml#L106

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Member

@wking wking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good in CI, where we currently report on the alert firing, but as a flake (when this pull will move it to fatal):

$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?maxAge=24h&type=junit&search=ClusterOperatorDegraded.*firing' | grep 'failures match'
periodic-ci-openshift-release-master-okd-scos-4.17-e2e-gcp-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
pull-ci-openshift-api-master-minor-e2e-upgrade-minor (all) - 14 runs, 43% failed, 83% of failures match = 36% impact
pull-ci-openshift-cluster-monitoring-operator-main-e2e-aws-ovn-upgrade (all) - 8 runs, 25% failed, 50% of failures match = 13% impact
pull-ci-openshift-cluster-monitoring-operator-release-4.19-e2e-aws-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-etcd-certrotation (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.21-e2e-gcp-ovn-upgrade-out-of-change (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.21-e2e-gcp-cilium (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.21-e2e-azure-ovn-upgrade-out-of-change (all) - 4 runs, 25% failed, 100% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-upgrade-out-of-change (all) - 4 runs, 50% failed, 50% of failures match = 25% impact
rehearse-69078-periodic-ci-openshift-release-master-ci-4.21-e2e-azure-ovn-upgrade-out-of-change (all) - 1 runs, 100% failed, 100% of failures match = 100% impac
$ curl -s 'https://search.dptools.openshift.org/search?maxAge=24h&type=junit&context=0&search=ClusterOperatorDegraded.*firing' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | grep -o 'labels:.*' | sort | uniq -c | sort -n
      1 labels: alertstate/firing severity/warning ALERTS{alertname="ClusterOperatorDegraded", alertstate="firing", name="etcd", namespace="openshift-cluster-version", prometheus="openshift-monitoring/k8s", reason="ClusterMemberController_SyncError::ClusterMemberRemovalController_SyncError::EtcdEndpoints_ErrorUpdatingEtcdEndpoints::EtcdMembersController_ErrorUpdatingReportEtcdMembers::RevisionController_SyncError", severity="warning"} result=reject
      1 labels: alertstate/firing severity/warning ALERTS{alertname="ClusterOperatorDegraded", alertstate="firing", name="ingress", namespace="openshift-cluster-version", prometheus="openshift-monitoring/k8s", reason="IngressDegraded", severity="warning"} result=reject
      1 labels: alertstate/firing severity/warning ALERTS{alertname="ClusterOperatorDegraded", alertstate="firing", name="version", namespace="openshift-cluster-version", prometheus="openshift-monitoring/k8s", reason="ClusterOperatorUpdating", severity="warning"} result=reject
      2 labels: alertstate/firing severity/warning ALERTS{alertname="ClusterOperatorDegraded", alertstate="firing", name="version", namespace="openshift-cluster-version", prometheus="openshift-monitoring/k8s", reason="ClusterOperatorNotAvailable", severity="warning"} result=reject
      4 labels: alertstate/firing severity/warning ALERTS{alertname="ClusterOperatorDegraded", alertstate="firing", name="olm", namespace="openshift-cluster-version", prometheus="openshift-monitoring/k8s", reason="OperatorControllerStaticResources_SyncError", severity="warning"} result=reject
      5 labels: alertstate/firing severity/warning ALERTS{alertname="ClusterOperatorDegraded", alertstate="firing", name="version", namespace="openshift-cluster-version", prometheus="openshift-monitoring/k8s", severity="warning"} result=reject
$ curl -s 'https://search.dptools.openshift.org/search?maxAge=24h&type=junit&context=0&search=ClusterOperatorDegraded.*firing.*olm' | jq -r 'keys[]'
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-upgrade-out-of-change/1968761457113829376
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.21-e2e-azure-ovn-upgrade-out-of-change/1968761542681825280
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.21-e2e-gcp-ovn-upgrade-out-of-change/1968853188824010752
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/69078/rehearse-69078-periodic-ci-openshift-release-master-ci-4.21-e2e-azure-ovn-upgrade-out-of-change/1968748785328721920

So pretty good success rates (with no flakes at all) in most jobs. A few jobs that have flakes (which this would make fatal). Most of those hits are ...upgrade-out-of-change on the tail end of the issue that openshift/cluster-olm-operator#139 recovered from

/verified by CI Search
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 19, 2025
@hongkailiu
Copy link
Member Author

/retest-requied

@hongkailiu
Copy link
Member Author

/retest-required

@petr-muller
Copy link
Member

/cc

@openshift-ci openshift-ci bot requested a review from petr-muller September 22, 2025 11:31
@sosiouxme
Copy link
Member

/approve

Copy link
Contributor

openshift-ci bot commented Sep 22, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hongkailiu, sosiouxme, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 22, 2025
@hongkailiu
Copy link
Member Author

/verified by "CI Search"

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Sep 22, 2025
@openshift-ci-robot
Copy link

@hongkailiu: This PR has been marked as verified by "CI Search".

In response to this:

/verified by "CI Search"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 92f0f58 and 2 for PR HEAD 57d02a2 in total

@hongkailiu
Copy link
Member Author

/test e2e-aws-ovn-fips

1 similar comment
@hongkailiu
Copy link
Member Author

/test e2e-aws-ovn-fips

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD a08d2f3 and 1 for PR HEAD 57d02a2 in total

@hongkailiu
Copy link
Member Author

/retest-required

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 6d8c2d0 and 0 for PR HEAD 57d02a2 in total

@openshift-ci-robot
Copy link

/hold

Revision 57d02a2 was retested 3 times: holding

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 24, 2025
Copy link
Contributor

openshift-ci bot commented Sep 24, 2025

@hongkailiu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-ipi-ovn-dualstack 57d02a2 link false /test e2e-metal-ipi-ovn-dualstack
ci/prow/e2e-metal-ipi-virtualmedia 57d02a2 link false /test e2e-metal-ipi-virtualmedia
ci/prow/e2e-aws-ovn-single-node-serial 57d02a2 link false /test e2e-aws-ovn-single-node-serial
ci/prow/e2e-aws-ovn-single-node-upgrade 57d02a2 link false /test e2e-aws-ovn-single-node-upgrade
ci/prow/e2e-azure 57d02a2 link false /test e2e-azure
ci/prow/e2e-aws-ovn-single-node 57d02a2 link false /test e2e-aws-ovn-single-node
ci/prow/e2e-gcp-ovn-techpreview-serial-2of2 57d02a2 link false /test e2e-gcp-ovn-techpreview-serial-2of2
ci/prow/e2e-openstack-ovn 57d02a2 link false /test e2e-openstack-ovn
ci/prow/e2e-aws-disruptive 57d02a2 link false /test e2e-aws-disruptive
ci/prow/e2e-aws-ovn-serial-1of2 57d02a2 link true /test e2e-aws-ovn-serial-1of2
ci/prow/e2e-aws-ovn-serial-2of2 57d02a2 link true /test e2e-aws-ovn-serial-2of2

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link

openshift-trt bot commented Sep 24, 2025

Job Failure Risk Analysis for sha: 57d02a2

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-serial-1of2 IncompleteTests
Tests for this run (25) are below the historical average (1624): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-aws-ovn-serial-2of2 IncompleteTests
Tests for this run (25) are below the historical average (1603): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

Risk analysis has seen new tests most likely introduced by this PR.
Please ensure that new tests meet guidelines for naming and stability.

New Test Risks for sha: 57d02a2

Job Name New Test Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-fips High - "[Monitor:legacy-test-framework-invariants][bz-Cluster Version Operator][invariant] alert/ClusterOperatorDegraded should not be at or above info" is a new test that was not present in all runs against the current commit.
pull-ci-openshift-origin-main-e2e-aws-ovn-serial-1of2 High - "[Monitor:legacy-test-framework-invariants][bz-Cluster Version Operator][invariant] alert/ClusterOperatorDegraded should not be at or above info" is a new test that was not present in all runs against the current commit.
pull-ci-openshift-origin-main-e2e-aws-ovn-serial-2of2 High - "[Monitor:legacy-test-framework-invariants][bz-Cluster Version Operator][invariant] alert/ClusterOperatorDegraded should not be at or above info" is a new test that was not present in all runs against the current commit.
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-ipv6 High - "[Monitor:legacy-test-framework-invariants][bz-Cluster Version Operator][invariant] alert/ClusterOperatorDegraded should not be at or above info" is a new test that was not present in all runs against the current commit.
pull-ci-openshift-origin-main-e2e-vsphere-ovn High - "[Monitor:legacy-test-framework-invariants][bz-Cluster Version Operator][invariant] alert/ClusterOperatorDegraded should not be at or above info" is a new test that was not present in all runs against the current commit.
pull-ci-openshift-origin-main-e2e-vsphere-ovn-upi High - "[Monitor:legacy-test-framework-invariants][bz-Cluster Version Operator][invariant] alert/ClusterOperatorDegraded should not be at or above info" is a new test that was not present in all runs against the current commit.

New tests seen in this PR at sha: 57d02a2

  • "[Monitor:legacy-test-framework-invariants][bz-Cluster Version Operator][invariant] alert/ClusterOperatorDegraded should not be at or above info" [Total: 59, Pass: 59, Fail: 0, Flake: 0]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants