Skip to content

Promote OnClusterBuild featuregate to default #2192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

yuqi-zhang
Copy link
Contributor

Final step to GA On Cluster Layering, after
openshift/machine-config-operator#4756 and #2134

See also updated enhancement: openshift/enhancements#1515

Copy link
Contributor

openshift-ci bot commented Feb 4, 2025

Hello @yuqi-zhang! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

@openshift-ci openshift-ci bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Feb 4, 2025
@yuqi-zhang
Copy link
Contributor Author

/hold

Currently for testing

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 5, 2025
@JoelSpeed
Copy link
Contributor

Given the absence of automated testing reporting into component readiness, there is a want to make an exception for this feature and promote within 4.19 based on QE testing alone

Could we please get a write up on this PR detailing the existing testing that QE have been completing, with links to historical pass rates where available

I would also appreciate if we can link out to any trackers for the future automated regression testing that's in the pipeline

@JoelSpeed
Copy link
Contributor

Verify failure in schema checker is an issue with the alpha API, we will override that issue

/test e2e-aws-ovn-hypershift
/test minor-e2e-upgrade-minor

@JoelSpeed
Copy link
Contributor

@yuqi-zhang @cheesesashimi The failures in the minor upgrade look like they are legitimate, can you please investigate?

@yuqi-zhang yuqi-zhang force-pushed the promote-onclusterbuild-featuregate branch from 244a7dc to c939ea4 Compare March 21, 2025 01:22
@yuqi-zhang
Copy link
Contributor Author

yuqi-zhang commented Mar 21, 2025

Rebased on master just in case. It looks like in the failed upgrade, the MCO hasn't started upgrade yet (?) and the operator logs are failing on:

E0320 14:53:14.386733       1 reflector.go:158] "Unhandled Error" err="github.com/openshift/machine-config-operator/pkg/operator/operator.go:390: Failed to watch *v1alpha1.MachineOSConfig: failed to list *v1alpha1.MachineOSConfig: the server could not find the requested resource (get machineosconfigs.machineconfiguration.openshift.io)"

checking if that was caused by outdated CRDs.

The kube-rbac-proxy failures should also be old, since the pods should have the proper SCC labels now

@JoelSpeed
Copy link
Contributor

/retest

@yuqi-zhang
Copy link
Contributor Author

ci/prow/minor-e2e-upgrade-minor seems to be testing 4.17->4.18 upgrade, is that expected? I would have expected this to test 4.18->4.19

@sergiordlr
Copy link

  • Jobs running MCO test cases in a cluster with OCL enabled in master and worker pools

periodic-ci-openshift-openshift-tests-private-release-4.19-amd64-nightly-aws-ipi-longrun-mco-tp-ocl-p3-f7

periodic-ci-openshift-openshift-tests-private-release-4.19-amd64-nightly-aws-ipi-longrun-mco-tp-ocl-p2-f7

periodic-ci-openshift-openshift-tests-private-release-4.19-amd64-nightly-aws-ipi-longrun-mco-tp-ocl-p1-f7

These jobs are currently impacted by this issue: OCPBUGS-53408 In OCL. error: Old and new refs are equal

  • Job running openshift e2e tests in a cluster with OCL configured in master and worker pools

periodic-ci-openshift-openshift-tests-private-release-4.19-amd64-nightly-aws-ipi-tp-ocl-f7

An OCL cluster does not behave exactly the same as a non-OCL cluster, so some test cases may fail in an OCL cluster because of that. Especially the ones that scale up new nodes (OCL needs an extra reboot) or configure registries.conf (OCL clusters reboot the nodes but non-OCL cluster don't do it)

  • Job running an upgrade in a cluster with OCL configured in master and worker pools

periodic-ci-openshift-openshift-tests-private-release-4.19-amd64-nightly-4.19-upgrade-from-stable-4.19-aws-ipi-ocl-fips-f60

Same case as above, since an OCL cluster doesn't behave exactly as a non-OCL cluster some cases will never pass. Especially the test cases involving scaling up new nodes (OCL needs an extra reboot to apply the new image), and those configuring the registres.conf file (OCL clusters reboots the nodes in this scenario, but non-OCL clusters do not reboot the nodes).

  • Jobs running all MCO test cases in clusters with Techpreview. Specific OCL tests included

These jobs have been reconfigured to run daily

IPI on GCP, AMD64,TP

periodic-ci-openshift-openshift-tests-private-release-4.19-amd64-nightly-gcp-ipi-longduration-tp-mco-p1-f9

periodic-ci-openshift-openshift-tests-private-release-4.19-amd64-nightly-gcp-ipi-longduration-tp-mco-p2-f9

periodic-ci-openshift-openshift-tests-private-release-4.19-amd64-nightly-gcp-ipi-longduration-tp-mco-p3-f9

IPI on AWS, ARM64,PROXY,FIPS,TP

These jobs are currently impacted by OCPBUGS-49894 In OCL. Disabling OCL process is not working in clusters with proxy

Unfortunately this issue breaks the cluster and we cannot recover it, so once our tests hit this issue all tests will report failures or refuse to be executed.

periodic-ci-openshift-openshift-tests-private-release-4.19-arm64-nightly-aws-ipi-longrun-mco-tp-proxy-fips-p1-f28

periodic-ci-openshift-openshift-tests-private-release-4.19-arm64-nightly-aws-ipi-longrun-mco-tp-proxy-fips-p2-f28

periodic-ci-openshift-openshift-tests-private-release-4.19-arm64-nightly-aws-ipi-longrun-mco-tp-proxy-fips-p3-f28

I have just launched all the jobs with the latest nightly build.

@JoelSpeed
Copy link
Contributor

Thanks for the writeup, this is useful context!

ci/prow/minor-e2e-upgrade-minor seems to be testing 4.17->4.18 upgrade, is that expected? I would have expected this to test 4.18->4.19

No, that's an error, this hasn't been updated since branching, we really need to make a way to automate this, I will get on it.

These jobs are currently impacted by this issue: OCPBUGS-53408 In OCL. error: Old and new refs are equal

Based on feedback from some promotions in the previous release, we would ideally see a week of clean, daily runs of the CI prior to merging. We should aim to prio fixing this ASAP (and the other issue mentioned later)

An OCL cluster does not behave exactly the same as a non-OCL cluster, so some test cases may fail in an OCL cluster because of that. Especially the ones that scale up new nodes (OCL needs an extra reboot) or configure registries.conf (OCL clusters reboot the nodes but non-OCL cluster don't do it)

How do you ensure you understand this signal when there are so many false positives? Is work being done to mitigate the false positives and resolve the differences between the OCL and non-OCL cases?

Jobs running all MCO test cases in clusters with Techpreview. Specific OCL tests included
These jobs have been reconfigured to run daily

Once promoted, we will also need the stable cluster configuration to run daily, has this also been adjusted?

@JoelSpeed
Copy link
Contributor

/test minor-e2e-upgrade-minor

@JoelSpeed
Copy link
Contributor

E2E minor is now doing the correct thing, lets see if it works this time!

@JoelSpeed
Copy link
Contributor

/retest

@JoelSpeed
Copy link
Contributor

Adding some notes here from conversations so that we capture them within the promotion process:

@cheesesashimi has gather pass rates for the last week and put them together in a sheet.

Looking at this, and based on our conversation in slack, we can see that our top 5 tests are showing a minimum 85% pass rate, and, that we have on average 5 runs per platform right now.

For reference, we would normally be asking for at least 5 tests, and for all tests to attain a 95% pass rate across 6 platforms with 14 runs over 2 weeks to promote any regular feature.

It would be good to understand what is causing the tests to fail, are there issues with the feature itself, or can they be attributed to other failures?

@JoelSpeed
Copy link
Contributor

/skip
/retest-required

@yuqi-zhang yuqi-zhang force-pushed the promote-onclusterbuild-featuregate branch from c939ea4 to d386d6a Compare April 16, 2025 15:48
@yuqi-zhang
Copy link
Contributor Author

Rebased just for good measure

@yuqi-zhang
Copy link
Contributor Author

/retest

2 similar comments
@yuqi-zhang
Copy link
Contributor Author

/retest

@cheesesashimi
Copy link
Member

/retest

@sdodson
Copy link
Member

sdodson commented Apr 18, 2025

/test e2e-aws-serial-techpreview e2e-upgrade minor-e2e-upgrade-minor e2e-aws-serial e2e-aws-ovn e2e-aws-ovn-techpreview e2e-upgrade-out-of-change

I believe all of the CI issues have cleared up so lets see if we can get some of those to pass.

@cheesesashimi
Copy link
Member

The verify-crd-schema failure seems to be related to the fact that we renamed a file to machineconfiguration/v1alpha1/zz_generated.crd-manifests/0000_80_machine-config_01_machineosbuilds.crd.yaml. According to the error message output by the script, overriding this job failure is appropriate in this scenario:

This verifier checks all files that have changed. In some cases you may have changed or renamed a file that already contained api violations, but you are not introducing a new violation. In such cases it is appropriate to /override the failing CI job.

@JoelSpeed
Copy link
Contributor

/override ci/prow/verify-crd-schema

As explained above, this happened through a rename of an existing file. These are from existing problems that we cannot rectify without widespread consideration for the impacts

Copy link
Contributor

openshift-ci bot commented Apr 21, 2025

@JoelSpeed: Overrode contexts on behalf of JoelSpeed: ci/prow/verify-crd-schema

In response to this:

/override ci/prow/verify-crd-schema

As explained above, this happened through a rename of an existing file. These are from existing problems that we cannot rectify without widespread consideration for the impacts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@cheesesashimi
Copy link
Member

/test e2e-aws-serial-techpreview e2e-upgrade minor-e2e-upgrade-minor e2e-aws-serial e2e-aws-ovn e2e-aws-ovn-techpreview e2e-upgrade-out-of-change

Try rerunning these since the failures did not appear related to this PR.

@cheesesashimi
Copy link
Member

/test e2e-aws-serial-techpreview e2e-upgrade minor-e2e-upgrade-minor e2e-aws-serial e2e-aws-ovn e2e-aws-ovn-techpreview e2e-upgrade-out-of-change

@JoelSpeed
Copy link
Contributor

The serial suite is pretty broken at the moment, since the suite passed on Friday, modulo a deprovision timeout, I think we can override this one

/override ci/prow/e2e-aws-serial-techpreview

Copy link
Contributor

openshift-ci bot commented Apr 22, 2025

@JoelSpeed: Overrode contexts on behalf of JoelSpeed: ci/prow/e2e-aws-serial-techpreview

In response to this:

The serial suite is pretty broken at the moment, since the suite passed on Friday, modulo a deprovision timeout, I think we can override this one

/override ci/prow/e2e-aws-serial-techpreview

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@JoelSpeed
Copy link
Contributor

/override ci/prow/e2e-aws-serial

This passed yesterday, and the base of main has only moved for a tooling change since, so I'm confident that this wouldn't have re-run if we hadn't merged that tooling change

Copy link
Contributor

openshift-ci bot commented Apr 22, 2025

@JoelSpeed: Overrode contexts on behalf of JoelSpeed: ci/prow/e2e-aws-serial

In response to this:

/override ci/prow/e2e-aws-serial

This passed yesterday, and the base of main has only moved for a tooling change since, so I'm confident that this wouldn't have re-run if we hadn't merged that tooling change

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@JoelSpeed
Copy link
Contributor

/override ci/prow/minor-e2e-upgrade-minor

This passed on Friday, but the tooling change meant it had to be re-run. I'm confident the tooling changes haven't impacted this result and so the value from Friday should still be valid

Copy link
Contributor

openshift-ci bot commented Apr 22, 2025

@JoelSpeed: Overrode contexts on behalf of JoelSpeed: ci/prow/minor-e2e-upgrade-minor

In response to this:

/override ci/prow/minor-e2e-upgrade-minor

This passed on Friday, but the tooling change meant it had to be re-run. I'm confident the tooling changes haven't impacted this result and so the value from Friday should still be valid

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@nstielau
Copy link

+1 to Promote OnClusterBuild featuregate to default, despite test results currently below the usual 95% expectation. cc @craychee

@JoelSpeed
Copy link
Contributor

Per #2192 (comment)

/lgtm
/override ci/prow/verify-feature-promotion

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 22, 2025
Copy link
Contributor

openshift-ci bot commented Apr 22, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 22, 2025
Copy link
Contributor

openshift-ci bot commented Apr 22, 2025

@JoelSpeed: Overrode contexts on behalf of JoelSpeed: ci/prow/verify-feature-promotion

In response to this:

Per #2192 (comment)

/lgtm
/override ci/prow/verify-feature-promotion

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

openshift-ci bot commented Apr 22, 2025

@yuqi-zhang: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@sdodson
Copy link
Member

sdodson commented Apr 22, 2025

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 22, 2025
@sdodson sdodson merged commit 9aa03e6 into openshift:master Apr 22, 2025
23 of 24 checks passed
@openshift-bot
Copy link

[ART PR BUILD NOTIFIER]

Distgit: ose-cluster-config-api
This PR has been included in build ose-cluster-config-api-container-v4.20.0-202504221841.p0.g9aa03e6.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants