Skip to content

Conversation

@kattz-kawa
Copy link
Contributor

@kattz-kawa kattz-kawa commented Nov 18, 2025

Why we need this PR

  • After new placement rule introduced in Run manager with 2 replicas #180, the manager pods can only be scheduled on two specific nodes. Under these conditions, the default strategy may cause rollout deadlocks during node recovery.
  • Certificate rotation managed by OLM can trigger Deployment rollouts by changing the olmcahash annotation. With the current strategy, this may lead to deadlocks if a node is down during recovery, causing rollout failures. To improve reliability and ensure safe updates under these conditions, we need a more controlled approach.

Changes made

  • Added RollingUpdate strategy with maxSurge: 0 and maxUnavailable: 1 for safer deployment updates.

  • Before (default): When a rollout starts, the Deployment tries to keep the old pod until the new one is ready. On clusters with only 2 nodes, this can fail because there is no room to schedule the new pod, leading to a stuck rollout.

  • After: Pods are updated one at a time. The controller deletes one pod first, then creates a new one, ensuring that rollout can proceed safely even when only 2 nodes are available.

Which issue(s) this PR fixes

RHWA-366

Summary by CodeRabbit

  • Chores
    • Updated deployment strategy configuration for controller-manager components to implement RollingUpdate strategy with specified constraints on pod surges and allowed unavailability. These operational improvements ensure more controlled and stable application updates while maintaining better service availability throughout upgrade processes.

…vailable=1) to mitigate rollout deadlocks during automatic certificate rotation by OLM

Signed-off-by: Katsuya Kawakami <kkawakami-je@nec.com>
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 18, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kattz-kawa
Once this PR has been reviewed and has the lgtm label, please assign razo7 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 18, 2025

Hi @kattz-kawa. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai
Copy link

coderabbitai bot commented Nov 18, 2025

📝 Walkthrough

Walkthrough

Two YAML manifest files are updated to configure the controller-manager deployment strategy from default to explicit RollingUpdate with maxSurge: 0 and maxUnavailable: 1, controlling pod surge and availability during rollouts.

Changes

Cohort / File(s) Summary
Deployment Strategy Configuration
bundle/manifests/self-node-remediation.clusterserviceversion.yaml, config/manager/manager.yaml
Adds deployment strategy block with RollingUpdate type, maxSurge: 0, and maxUnavailable: 1 to control rollout behavior during updates

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

  • Simple, repetitive configuration changes applied consistently across two files
  • No behavioral logic, error handling, or conditional changes
  • Straightforward verification of strategy parameters

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Change Deployment UpdateStrategy' clearly and concisely summarizes the main change—modifying the Deployment update strategy configuration.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d80706e and a7f0d7a.

📒 Files selected for processing (2)
  • bundle/manifests/self-node-remediation.clusterserviceversion.yaml (1 hunks)
  • config/manager/manager.yaml (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (2)
config/manager/manager.yaml (1)

20-24: LGTM! Strategy configuration correctly addresses the 2-node cluster rollout deadlock.

The RollingUpdate strategy with maxSurge: 0 and maxUnavailable: 1 ensures that only one pod is replaced at a time, allowing new pods to be scheduled despite the topology constraints (lines 30–36) that restrict manager pods to specific nodes. This aligns with the operator's requirement to prevent rollout deadlocks during node recovery or certificate rotation.

Please confirm that both this file (config/manager/manager.yaml) and the OLM bundle manifest are kept in sync by a build/generation step, so future updates to one are automatically reflected in the other.

bundle/manifests/self-node-remediation.clusterserviceversion.yaml (1)

276-280: LGTM! Strategy configuration matches the intent defined in the source deployment manifest.

The nested rollingUpdate object and RollingUpdate type are correctly configured. The field ordering differs slightly from config/manager/manager.yaml (fields reversed), but this is inconsequential for YAML/Kubernetes semantics.

Verify that a build or generation tool automatically updates this bundle manifest when config/manager/manager.yaml changes, to maintain consistency and prevent manual sync errors in future updates.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant