Skip to content

OCPBUGS-55238: spyglass: hide disruption events for localhost #29710

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

vrutkovs
Copy link
Member

Don't display localhost-related disruptions on spyglass. These are still displayed on non-spyglass reports in case unexpected localhost disruption happens

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Apr 24, 2025
@openshift-ci-robot
Copy link

@vrutkovs: This pull request references Jira Issue OCPBUGS-55238, which is invalid:

  • expected the bug to target the "4.19.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Don't display localhost-related disruptions on spyglass. These are still displayed on non-spyglass reports in case unexpected localhost disruption happens

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Apr 24, 2025
@vrutkovs
Copy link
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 24, 2025
@openshift-ci-robot
Copy link

@vrutkovs: This pull request references Jira Issue OCPBUGS-55238, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @wangke19

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

openshift-trt bot commented May 2, 2025

Risk analysis has seen new tests most likely introduced by this PR.
Please ensure that new tests meet guidelines for naming and stability.

New Test Risks for sha: c900caa

Job Name New Test Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-serial Medium - "Find all of the input images from ocp/4.20 and tag them into the stable stream" is a new test, and was only seen in one job.
pull-ci-openshift-origin-main-e2e-aws-ovn-serial Medium - "Find all of the input images from ocp/4.20 and tag them into the stable-initial stream" is a new test, and was only seen in one job.

New tests seen in this PR at sha: c900caa

  • "Find all of the input images from ocp/4.20 and tag them into the stable stream" [Total: 1, Pass: 1, Fail: 0, Flake: 0]
  • "Find all of the input images from ocp/4.20 and tag them into the stable-initial stream" [Total: 1, Pass: 1, Fail: 0, Flake: 0]

@dgoodwin
Copy link
Contributor

The problem with leaving expected disruption in and hiding it in the UI is the larger system used to monitor disruption data, all of which needs the same accommodations otherwise it flags localhost disruption as disruption and starts monitoring for changes. This would include the grafana dashboard, the alerts in dpcr cluster, and the metrics published by sippy for those alerts, as well as scheduled queries in bigquery used for the reporting.

Do you intend to have this monitored for changes in disruption and pursue fixes for those issues?

If so then maybe we leave it in. (but we wouldn't to hide it on interval charts)

If not, these intervals really should be classified with a different source. That would immediately remove them from the analysis framework, and they would not appear in this chart.

Also remember the new intervals UI under debug tools is at https://github.com/openshift/sippy/blob/main/sippy-ng/src/prow_job_runs/IntervalsChart.js and it is largely based on categorizing by Source.

@vrutkovs
Copy link
Member Author

Do you intend to have this monitored for changes in disruption and pursue fixes for those issues?

Localhost disruptions are expected when pod restarts (on rollout), but may be misleading - in most cases they are expected to happen.

If so then maybe we leave it in. (but we wouldn't to hide it on interval charts)

We're hiding them on the main chart, but leaving on non-spyglass charts for completeness.

If not, these intervals really should be classified with a different source. That would immediately remove them from the analysis framework, and they would not appear in this chart.

I don't think these are being sent for analysis anyway

@dgoodwin
Copy link
Contributor

They have been spamming #trt-alerts for weeks now, up to and including today, they are definitely going into the analysis system.

Can you skip generating the intervals when it's expected?

@vrutkovs vrutkovs force-pushed the disruption-exclude-localhost branch from c900caa to c3917f0 Compare July 22, 2025 08:10
@vrutkovs
Copy link
Member Author

I think it's easier to move them to a different source

vrutkovs added 2 commits July 22, 2025 13:08
Localhost disruptions on apiservers are useful to record, however some of them are expected (i.e.
during installer pod rollout). Instead of hiding them entirely these are created as a separate
source and hidden on spyglass/sippy view. Other views are displaying them in case they are helpful
to find correlations
@vrutkovs vrutkovs force-pushed the disruption-exclude-localhost branch from c3917f0 to da1a05c Compare July 22, 2025 11:08
@dgoodwin
Copy link
Contributor

This looks great, thank you, just waiting to see the resulting files.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 22, 2025
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 22, 2025
Copy link
Contributor

openshift-ci bot commented Jul 22, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin, vrutkovs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 22, 2025
@vrutkovs
Copy link
Member Author

/hold cancel

Yup, looks good

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 22, 2025
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD af0e85d and 2 for PR HEAD da1a05c in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 47eed7a and 1 for PR HEAD da1a05c in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD b392d63 and 2 for PR HEAD da1a05c in total

Copy link
Contributor

openshift-ci bot commented Jul 23, 2025

@vrutkovs: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade-rollback c900caa link false /test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade-rollback
ci/prow/e2e-metal-ipi-ovn-dualstack-bgp-local-gw-techpreview c900caa link false /test e2e-metal-ipi-ovn-dualstack-bgp-local-gw-techpreview
ci/prow/okd-e2e-gcp c900caa link false /test okd-e2e-gcp
ci/prow/e2e-gcp-fips-serial c900caa link false /test e2e-gcp-fips-serial
ci/prow/e2e-metal-ipi-ovn-dualstack-bgp-techpreview c900caa link false /test e2e-metal-ipi-ovn-dualstack-bgp-techpreview
ci/prow/e2e-metal-ipi-serial c900caa link false /test e2e-metal-ipi-serial
ci/prow/e2e-metal-ipi-serial-ovn-ipv6 c900caa link false /test e2e-metal-ipi-serial-ovn-ipv6
ci/prow/e2e-aws-ovn-serial c900caa link true /test e2e-aws-ovn-serial
ci/prow/e2e-aws-ovn-serial-publicnet c900caa link true /test e2e-aws-ovn-serial-publicnet
ci/prow/e2e-aws-ovn-kube-apiserver-rollout da1a05c link false /test e2e-aws-ovn-kube-apiserver-rollout
ci/prow/e2e-gcp-ovn-rt-upgrade da1a05c link false /test e2e-gcp-ovn-rt-upgrade
ci/prow/e2e-aws-ovn-etcd-scaling da1a05c link false /test e2e-aws-ovn-etcd-scaling
ci/prow/okd-scos-e2e-aws-ovn da1a05c link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-gcp-disruptive da1a05c link false /test e2e-gcp-disruptive
ci/prow/e2e-gcp-fips-serial-2of2 da1a05c link false /test e2e-gcp-fips-serial-2of2
ci/prow/e2e-azure-ovn-etcd-scaling da1a05c link false /test e2e-azure-ovn-etcd-scaling
ci/prow/e2e-openstack-serial da1a05c link false /test e2e-openstack-serial
ci/prow/e2e-azure-ovn-upgrade da1a05c link false /test e2e-azure-ovn-upgrade
ci/prow/e2e-gcp-ovn-techpreview da1a05c link false /test e2e-gcp-ovn-techpreview
ci/prow/e2e-openstack-ovn da1a05c link false /test e2e-openstack-ovn
ci/prow/e2e-aws-disruptive da1a05c link false /test e2e-aws-disruptive
ci/prow/e2e-aws-ovn-microshift-serial da1a05c link false /test e2e-aws-ovn-microshift-serial
ci/prow/e2e-gcp-ovn-etcd-scaling da1a05c link false /test e2e-gcp-ovn-etcd-scaling
ci/prow/e2e-aws-ovn-microshift da1a05c link false /test e2e-aws-ovn-microshift
ci/prow/e2e-gcp-fips-serial-1of2 da1a05c link false /test e2e-gcp-fips-serial-1of2
ci/prow/e2e-gcp-ovn-techpreview-serial-2of2 da1a05c link false /test e2e-gcp-ovn-techpreview-serial-2of2
ci/prow/e2e-aws-ovn-single-node-upgrade da1a05c link false /test e2e-aws-ovn-single-node-upgrade
ci/prow/e2e-vsphere-ovn-dualstack-primaryv6 da1a05c link false /test e2e-vsphere-ovn-dualstack-primaryv6
ci/prow/e2e-vsphere-ovn-etcd-scaling da1a05c link false /test e2e-vsphere-ovn-etcd-scaling

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link

openshift-trt bot commented Jul 23, 2025

Job Failure Risk Analysis for sha: da1a05c

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-aws-disruptive IncompleteTests
Tests for this run (106) are below the historical average (341): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-gcp-ovn-etcd-scaling Low
[bz-etcd][invariant] alert/etcdMembersDown should not be at or above info
This test has passed 0.00% of 1 runs on release 4.20 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:rare Network:ovn NetworkStack:ipv4 Owner:eng Platform:gcp SecurityMode:default Topology:ha Upgrade:none] in the last week.

Open Bugs
etcd-scaling jobs failing ~60% of the time
---
[bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Degraded
This test has passed 0.00% of 1 runs on release 4.20 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:rare Network:ovn NetworkStack:ipv4 Owner:eng Platform:gcp SecurityMode:default Topology:ha Upgrade:none] in the last week.

Open Bugs
etcd-scaling jobs failing ~60% of the time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants