Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 increase the timeout when creating and upgrading CAPI controllers #27

Merged
merged 2 commits into from
Oct 1, 2024

Conversation

dkoshkin
Copy link
Collaborator

@dkoshkin dkoshkin commented Oct 1, 2024

What this PR does / why we need it:
This PR is an attempt to fix the race conditions we've been seeing when creating and upgrading CAPI controllers.
There are 2 fixes here:

  1. Extend CAPI's warm-up timeout to prevent it from getting restarted prematurely when waiting for RuntimeExtentions are starting up.
  2. Extend clusterctl timeout when creating and upgrading CAPI controllers.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

When starting up CAPI will try to "warm-up" the configured RuntimeExtensions,
and fail if any runtime extension Pods are not up.
This should fix the race conditions seen when creating and upgrading CAPI providers.
@dkoshkin dkoshkin changed the title fix: increase the timeout when creating and upgrading CAPI controllers 🐛 increase the timeout when creating and upgrading CAPI controllers Oct 1, 2024
Copy link

@supershal supershal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Dimitri. We went through the code.

  • We need to also increase timeout when invoking clusterctl ApplyUpgrade API call to ensure that it accounts for registry warmup timeout.

  • We will start discussion upstream to improve algorithm for upgrade capi component process.

// Jitter is added as a random fraction of the duration multiplied by the jitter factor.
return wait.Backoff{
Duration: 500 * time.Millisecond,
Factor: 1.5,
Steps: 10,
Steps: 14,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should tweak Factor to make the max interval size smaller? Perhaps increasing duration and status, but reducing factor could lead to a longer overall timeout but retain better UX with less excessive waiting?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I though about that too but avoided for now to reduce the risk of introducing another race somewhere else.

@dkoshkin dkoshkin merged commit 4151009 into d2iq/release-1.7.4 Oct 1, 2024
10 of 11 checks passed
dkoshkin added a commit that referenced this pull request Oct 9, 2024
)

* fix: extend the CAPI warmup timeout

When starting up CAPI will try to "warm-up" the configured RuntimeExtensions,
and fail if any runtime extension Pods are not up.

* fix: extend the clusterctl timeout

This should fix the race conditions seen when creating and upgrading CAPI providers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants