-
Notifications
You must be signed in to change notification settings - Fork 4.7k
OTA-1626: Fail CI if alert/ClusterOperatorDegraded is fired #30282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The alert ClusterOperatorDegraded is fatal because it indicates - cluster version operator is working (reporting Avaiable=True) but sad (not reporting Failing=False), or - some cluster operator is sad (not reporting Degraded=False) We do not want any of those to happen in the CI upgrade tests. When a test fails up to the alert, we can check the ClusterVersion manifest usually collected as an artifact of the test to figure out the reason that triggered the alert. If it was caused by some CO, we should file a bug for the component that CO belongs to. Otherwise, it is a bug for CVO. [1]. https://github.com/openshift/cluster-version-operator/blob/d81848ae818b277bdaffa5375ad6366111c4143c/install/0000_90_cluster-version-operator_02_servicemonitor.yaml#L106
@hongkailiu: This pull request references OTA-1626 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good in CI, where we currently report on the alert firing, but as a flake (when this pull will move it to fatal):
$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?maxAge=24h&type=junit&search=ClusterOperatorDegraded.*firing' | grep 'failures match'
periodic-ci-openshift-release-master-okd-scos-4.17-e2e-gcp-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
pull-ci-openshift-api-master-minor-e2e-upgrade-minor (all) - 14 runs, 43% failed, 83% of failures match = 36% impact
pull-ci-openshift-cluster-monitoring-operator-main-e2e-aws-ovn-upgrade (all) - 8 runs, 25% failed, 50% of failures match = 13% impact
pull-ci-openshift-cluster-monitoring-operator-release-4.19-e2e-aws-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-etcd-certrotation (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.21-e2e-gcp-ovn-upgrade-out-of-change (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.21-e2e-gcp-cilium (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.21-e2e-azure-ovn-upgrade-out-of-change (all) - 4 runs, 25% failed, 100% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-upgrade-out-of-change (all) - 4 runs, 50% failed, 50% of failures match = 25% impact
rehearse-69078-periodic-ci-openshift-release-master-ci-4.21-e2e-azure-ovn-upgrade-out-of-change (all) - 1 runs, 100% failed, 100% of failures match = 100% impac
$ curl -s 'https://search.dptools.openshift.org/search?maxAge=24h&type=junit&context=0&search=ClusterOperatorDegraded.*firing' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | grep -o 'labels:.*' | sort | uniq -c | sort -n
1 labels: alertstate/firing severity/warning ALERTS{alertname="ClusterOperatorDegraded", alertstate="firing", name="etcd", namespace="openshift-cluster-version", prometheus="openshift-monitoring/k8s", reason="ClusterMemberController_SyncError::ClusterMemberRemovalController_SyncError::EtcdEndpoints_ErrorUpdatingEtcdEndpoints::EtcdMembersController_ErrorUpdatingReportEtcdMembers::RevisionController_SyncError", severity="warning"} result=reject
1 labels: alertstate/firing severity/warning ALERTS{alertname="ClusterOperatorDegraded", alertstate="firing", name="ingress", namespace="openshift-cluster-version", prometheus="openshift-monitoring/k8s", reason="IngressDegraded", severity="warning"} result=reject
1 labels: alertstate/firing severity/warning ALERTS{alertname="ClusterOperatorDegraded", alertstate="firing", name="version", namespace="openshift-cluster-version", prometheus="openshift-monitoring/k8s", reason="ClusterOperatorUpdating", severity="warning"} result=reject
2 labels: alertstate/firing severity/warning ALERTS{alertname="ClusterOperatorDegraded", alertstate="firing", name="version", namespace="openshift-cluster-version", prometheus="openshift-monitoring/k8s", reason="ClusterOperatorNotAvailable", severity="warning"} result=reject
4 labels: alertstate/firing severity/warning ALERTS{alertname="ClusterOperatorDegraded", alertstate="firing", name="olm", namespace="openshift-cluster-version", prometheus="openshift-monitoring/k8s", reason="OperatorControllerStaticResources_SyncError", severity="warning"} result=reject
5 labels: alertstate/firing severity/warning ALERTS{alertname="ClusterOperatorDegraded", alertstate="firing", name="version", namespace="openshift-cluster-version", prometheus="openshift-monitoring/k8s", severity="warning"} result=reject
$ curl -s 'https://search.dptools.openshift.org/search?maxAge=24h&type=junit&context=0&search=ClusterOperatorDegraded.*firing.*olm' | jq -r 'keys[]'
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-upgrade-out-of-change/1968761457113829376
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.21-e2e-azure-ovn-upgrade-out-of-change/1968761542681825280
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.21-e2e-gcp-ovn-upgrade-out-of-change/1968853188824010752
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/69078/rehearse-69078-periodic-ci-openshift-release-master-ci-4.21-e2e-azure-ovn-upgrade-out-of-change/1968748785328721920
So pretty good success rates (with no flakes at all) in most jobs. A few jobs that have flakes (which this would make fatal). Most of those hits are ...upgrade-out-of-change
on the tail end of the issue that openshift/cluster-olm-operator#139 recovered from
/verified by CI Search
/lgtm
/retest-requied |
/retest-required |
/cc |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hongkailiu, sosiouxme, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/verified by "CI Search" |
@hongkailiu: This PR has been marked as verified by In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/test e2e-aws-ovn-fips |
1 similar comment
/test e2e-aws-ovn-fips |
/retest-required |
/hold Revision 57d02a2 was retested 3 times: holding |
@hongkailiu: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Job Failure Risk Analysis for sha: 57d02a2
Risk analysis has seen new tests most likely introduced by this PR. New Test Risks for sha: 57d02a2
New tests seen in this PR at sha: 57d02a2
|
The alert ClusterOperatorDegraded is fatal because it indicates
We do not want any of those to happen in the CI upgrade tests.
When a test fails up to the alert, we can check the ClusterVersion manifest usually collected as an artifact of the test to figure out the reason that triggered the alert.
If it was caused by some CO, we should file a bug for the component that CO belongs to.
Otherwise, it is a bug for CVO.
[1]. https://github.com/openshift/cluster-version-operator/blob/d81848ae818b277bdaffa5375ad6366111c4143c/install/0000_90_cluster-version-operator_02_servicemonitor.yaml#L106