OTA-1626: Fail CI if alert/ClusterOperatorDegraded is fired #30282

hongkailiu · 2025-09-19T16:05:38Z

The alert ClusterOperatorDegraded is fatal because it indicates

cluster version operator is working (reporting Avaiable=True) but sad (not reporting Failing=False), or
some cluster operator is sad (not reporting Degraded=False)

We do not want any of those to happen in the CI upgrade tests.

When a test fails up to the alert, we can check the ClusterVersion manifest usually collected as an artifact of the test to figure out the reason that triggered the alert.

If it was caused by some CO, we should file a bug for the component that CO belongs to.
Otherwise, it is a bug for CVO.

[1]. https://github.com/openshift/cluster-version-operator/blob/d81848ae818b277bdaffa5375ad6366111c4143c/install/0000_90_cluster-version-operator_02_servicemonitor.yaml#L106

The alert ClusterOperatorDegraded is fatal because it indicates - cluster version operator is working (reporting Avaiable=True) but sad (not reporting Failing=False), or - some cluster operator is sad (not reporting Degraded=False) We do not want any of those to happen in the CI upgrade tests. When a test fails up to the alert, we can check the ClusterVersion manifest usually collected as an artifact of the test to figure out the reason that triggered the alert. If it was caused by some CO, we should file a bug for the component that CO belongs to. Otherwise, it is a bug for CVO. [1]. https://github.com/openshift/cluster-version-operator/blob/d81848ae818b277bdaffa5375ad6366111c4143c/install/0000_90_cluster-version-operator_02_servicemonitor.yaml#L106

openshift-ci-robot · 2025-09-19T16:05:42Z

@hongkailiu: This pull request references OTA-1626 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

The alert ClusterOperatorDegraded is fatal because it indicates

cluster version operator is working (reporting Avaiable=True) but sad (not reporting Failing=False), or

some cluster operator is sad (not reporting Degraded=False)

We do not want any of those to happen in the CI upgrade tests.

When a test fails up to the alert, we can check the ClusterVersion manifest usually collected as an artifact of the test to figure out the reason that triggered the alert.

If it was caused by some CO, we should file a bug for the component that CO belongs to.
Otherwise, it is a bug for CVO.

[1]. https://github.com/openshift/cluster-version-operator/blob/d81848ae818b277bdaffa5375ad6366111c4143c/install/0000_90_cluster-version-operator_02_servicemonitor.yaml#L106

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

wking

Looking good in CI, where we currently report on the alert firing, but as a flake (when this pull will move it to fatal):

$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?maxAge=24h&type=junit&search=ClusterOperatorDegraded.*firing' | grep 'failures match'
periodic-ci-openshift-release-master-okd-scos-4.17-e2e-gcp-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
pull-ci-openshift-api-master-minor-e2e-upgrade-minor (all) - 14 runs, 43% failed, 83% of failures match = 36% impact
pull-ci-openshift-cluster-monitoring-operator-main-e2e-aws-ovn-upgrade (all) - 8 runs, 25% failed, 50% of failures match = 13% impact
pull-ci-openshift-cluster-monitoring-operator-release-4.19-e2e-aws-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-etcd-certrotation (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.21-e2e-gcp-ovn-upgrade-out-of-change (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.21-e2e-gcp-cilium (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.21-e2e-azure-ovn-upgrade-out-of-change (all) - 4 runs, 25% failed, 100% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-upgrade-out-of-change (all) - 4 runs, 50% failed, 50% of failures match = 25% impact
rehearse-69078-periodic-ci-openshift-release-master-ci-4.21-e2e-azure-ovn-upgrade-out-of-change (all) - 1 runs, 100% failed, 100% of failures match = 100% impac
$ curl -s 'https://search.dptools.openshift.org/search?maxAge=24h&type=junit&context=0&search=ClusterOperatorDegraded.*firing' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | grep -o 'labels:.*' | sort | uniq -c | sort -n
      1 labels: alertstate/firing severity/warning ALERTS{alertname="ClusterOperatorDegraded", alertstate="firing", name="etcd", namespace="openshift-cluster-version", prometheus="openshift-monitoring/k8s", reason="ClusterMemberController_SyncError::ClusterMemberRemovalController_SyncError::EtcdEndpoints_ErrorUpdatingEtcdEndpoints::EtcdMembersController_ErrorUpdatingReportEtcdMembers::RevisionController_SyncError", severity="warning"} result=reject
      1 labels: alertstate/firing severity/warning ALERTS{alertname="ClusterOperatorDegraded", alertstate="firing", name="ingress", namespace="openshift-cluster-version", prometheus="openshift-monitoring/k8s", reason="IngressDegraded", severity="warning"} result=reject
      1 labels: alertstate/firing severity/warning ALERTS{alertname="ClusterOperatorDegraded", alertstate="firing", name="version", namespace="openshift-cluster-version", prometheus="openshift-monitoring/k8s", reason="ClusterOperatorUpdating", severity="warning"} result=reject
      2 labels: alertstate/firing severity/warning ALERTS{alertname="ClusterOperatorDegraded", alertstate="firing", name="version", namespace="openshift-cluster-version", prometheus="openshift-monitoring/k8s", reason="ClusterOperatorNotAvailable", severity="warning"} result=reject
      4 labels: alertstate/firing severity/warning ALERTS{alertname="ClusterOperatorDegraded", alertstate="firing", name="olm", namespace="openshift-cluster-version", prometheus="openshift-monitoring/k8s", reason="OperatorControllerStaticResources_SyncError", severity="warning"} result=reject
      5 labels: alertstate/firing severity/warning ALERTS{alertname="ClusterOperatorDegraded", alertstate="firing", name="version", namespace="openshift-cluster-version", prometheus="openshift-monitoring/k8s", severity="warning"} result=reject
$ curl -s 'https://search.dptools.openshift.org/search?maxAge=24h&type=junit&context=0&search=ClusterOperatorDegraded.*firing.*olm' | jq -r 'keys[]'
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-upgrade-out-of-change/1968761457113829376
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.21-e2e-azure-ovn-upgrade-out-of-change/1968761542681825280
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.21-e2e-gcp-ovn-upgrade-out-of-change/1968853188824010752
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/69078/rehearse-69078-periodic-ci-openshift-release-master-ci-4.21-e2e-azure-ovn-upgrade-out-of-change/1968748785328721920

So pretty good success rates (with no flakes at all) in most jobs. A few jobs that have flakes (which this would make fatal). Most of those hits are ...upgrade-out-of-change on the tail end of the issue that openshift/cluster-olm-operator#139 recovered from

/verified by CI Search
/lgtm

hongkailiu · 2025-09-20T01:26:05Z

/retest-requied

hongkailiu · 2025-09-20T01:26:16Z

/retest-required

petr-muller · 2025-09-22T11:31:19Z

/cc

sosiouxme · 2025-09-22T11:58:52Z

/approve

openshift-ci · 2025-09-22T11:59:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hongkailiu, sosiouxme, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [sosiouxme]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hongkailiu · 2025-09-22T12:24:50Z

/verified by "CI Search"

openshift-ci-robot · 2025-09-22T12:25:03Z

@hongkailiu: This PR has been marked as verified by "CI Search".

In response to this:

/verified by "CI Search"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-09-22T15:20:01Z

/retest-required

Remaining retests: 0 against base HEAD 92f0f58 and 2 for PR HEAD 57d02a2 in total

hongkailiu · 2025-09-22T21:55:06Z

/test e2e-aws-ovn-fips

hongkailiu · 2025-09-23T11:29:34Z

/test e2e-aws-ovn-fips

openshift-ci-robot · 2025-09-23T14:12:45Z

/retest-required

Remaining retests: 0 against base HEAD a08d2f3 and 1 for PR HEAD 57d02a2 in total

hongkailiu · 2025-09-24T13:27:49Z

/retest-required

openshift-ci-robot · 2025-09-24T18:09:08Z

/retest-required

Remaining retests: 0 against base HEAD 6d8c2d0 and 0 for PR HEAD 57d02a2 in total

openshift-ci-robot · 2025-09-24T20:09:26Z

/hold

Revision 57d02a2 was retested 3 times: holding

openshift-ci · 2025-09-24T20:13:42Z

@hongkailiu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-metal-ipi-ovn-dualstack	`57d02a2`	link	false	`/test e2e-metal-ipi-ovn-dualstack`
ci/prow/e2e-metal-ipi-virtualmedia	`57d02a2`	link	false	`/test e2e-metal-ipi-virtualmedia`
ci/prow/e2e-aws-ovn-single-node-serial	`57d02a2`	link	false	`/test e2e-aws-ovn-single-node-serial`
ci/prow/e2e-aws-ovn-single-node-upgrade	`57d02a2`	link	false	`/test e2e-aws-ovn-single-node-upgrade`
ci/prow/e2e-azure	`57d02a2`	link	false	`/test e2e-azure`
ci/prow/e2e-aws-ovn-single-node	`57d02a2`	link	false	`/test e2e-aws-ovn-single-node`
ci/prow/e2e-gcp-ovn-techpreview-serial-2of2	`57d02a2`	link	false	`/test e2e-gcp-ovn-techpreview-serial-2of2`
ci/prow/e2e-openstack-ovn	`57d02a2`	link	false	`/test e2e-openstack-ovn`
ci/prow/e2e-aws-disruptive	`57d02a2`	link	false	`/test e2e-aws-disruptive`
ci/prow/e2e-aws-ovn-serial-1of2	`57d02a2`	link	true	`/test e2e-aws-ovn-serial-1of2`
ci/prow/e2e-aws-ovn-serial-2of2	`57d02a2`	link	true	`/test e2e-aws-ovn-serial-2of2`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-trt · 2025-09-24T21:10:56Z

Job Failure Risk Analysis for sha: 57d02a2

Job Name	Failure Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-serial-1of2	IncompleteTests Tests for this run (25) are below the historical average (1624): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-aws-ovn-serial-2of2	IncompleteTests Tests for this run (25) are below the historical average (1603): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

Risk analysis has seen new tests most likely introduced by this PR.
Please ensure that new tests meet guidelines for naming and stability.

New Test Risks for sha: 57d02a2

Job Name	New Test Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-fips	High - "[Monitor:legacy-test-framework-invariants][bz-Cluster Version Operator][invariant] alert/ClusterOperatorDegraded should not be at or above info" is a new test that was not present in all runs against the current commit.
pull-ci-openshift-origin-main-e2e-aws-ovn-serial-1of2	High - "[Monitor:legacy-test-framework-invariants][bz-Cluster Version Operator][invariant] alert/ClusterOperatorDegraded should not be at or above info" is a new test that was not present in all runs against the current commit.
pull-ci-openshift-origin-main-e2e-aws-ovn-serial-2of2	High - "[Monitor:legacy-test-framework-invariants][bz-Cluster Version Operator][invariant] alert/ClusterOperatorDegraded should not be at or above info" is a new test that was not present in all runs against the current commit.
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-ipv6	High - "[Monitor:legacy-test-framework-invariants][bz-Cluster Version Operator][invariant] alert/ClusterOperatorDegraded should not be at or above info" is a new test that was not present in all runs against the current commit.
pull-ci-openshift-origin-main-e2e-vsphere-ovn	High - "[Monitor:legacy-test-framework-invariants][bz-Cluster Version Operator][invariant] alert/ClusterOperatorDegraded should not be at or above info" is a new test that was not present in all runs against the current commit.
pull-ci-openshift-origin-main-e2e-vsphere-ovn-upi	High - "[Monitor:legacy-test-framework-invariants][bz-Cluster Version Operator][invariant] alert/ClusterOperatorDegraded should not be at or above info" is a new test that was not present in all runs against the current commit.

New tests seen in this PR at sha: 57d02a2

"[Monitor:legacy-test-framework-invariants][bz-Cluster Version Operator][invariant] alert/ClusterOperatorDegraded should not be at or above info" [Total: 59, Pass: 59, Fail: 0, Flake: 0]

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 19, 2025

openshift-ci bot requested review from p0lyn0mial and sjenning September 19, 2025 16:09

wking approved these changes Sep 19, 2025

View reviewed changes

openshift-ci bot assigned wking Sep 19, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 19, 2025

openshift-ci bot requested a review from petr-muller September 22, 2025 11:31

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 22, 2025

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Sep 22, 2025

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 24, 2025

OTA-1626: Fail CI if alert/ClusterOperatorDegraded is fired #30282

Are you sure you want to change the base?

OTA-1626: Fail CI if alert/ClusterOperatorDegraded is fired #30282

Conversation

hongkailiu commented Sep 19, 2025

Uh oh!

openshift-ci-robot commented Sep 19, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wking left a comment

Choose a reason for hiding this comment

Uh oh!

hongkailiu commented Sep 20, 2025

Uh oh!

hongkailiu commented Sep 20, 2025

Uh oh!

petr-muller commented Sep 22, 2025

Uh oh!

sosiouxme commented Sep 22, 2025

Uh oh!

openshift-ci bot commented Sep 22, 2025

Uh oh!

hongkailiu commented Sep 22, 2025

Uh oh!

openshift-ci-robot commented Sep 22, 2025

Uh oh!

openshift-ci-robot commented Sep 22, 2025

Uh oh!

hongkailiu commented Sep 22, 2025

Uh oh!

hongkailiu commented Sep 23, 2025

Uh oh!

openshift-ci-robot commented Sep 23, 2025

Uh oh!

hongkailiu commented Sep 24, 2025

Uh oh!

openshift-ci-robot commented Sep 24, 2025

Uh oh!

openshift-ci-robot commented Sep 24, 2025

Uh oh!

openshift-ci bot commented Sep 24, 2025

Uh oh!

openshift-trt bot commented Sep 24, 2025

Uh oh!

Uh oh!

openshift-ci-robot commented Sep 19, 2025 •

edited by openshift-ci bot

Loading