🐛 increase the timeout when creating and upgrading CAPI controllers #27

dkoshkin · 2024-10-01T16:56:56Z

What this PR does / why we need it:
This PR is an attempt to fix the race conditions we've been seeing when creating and upgrading CAPI controllers.
There are 2 fixes here:

Extend CAPI's warm-up timeout to prevent it from getting restarted prematurely when waiting for RuntimeExtentions are starting up.
Extend clusterctl timeout when creating and upgrading CAPI controllers.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

When starting up CAPI will try to "warm-up" the configured RuntimeExtensions, and fail if any runtime extension Pods are not up.

This should fix the race conditions seen when creating and upgrading CAPI providers.

supershal

Thank you Dimitri. We went through the code.

We need to also increase timeout when invoking clusterctl ApplyUpgrade API call to ensure that it accounts for registry warmup timeout.
We will start discussion upstream to improve algorithm for upgrade capi component process.

jimmidyson · 2024-10-01T18:29:08Z

cmd/clusterctl/client/cluster/client.go

 	// Jitter is added as a random fraction of the duration multiplied by the jitter factor.
 	return wait.Backoff{
 		Duration: 500 * time.Millisecond,
 		Factor:   1.5,
-		Steps:    10,
+		Steps:    14,


I wonder if we should tweak Factor to make the max interval size smaller? Perhaps increasing duration and status, but reducing factor could lead to a longer overall timeout but retain better UX with less excessive waiting?

I though about that too but avoided for now to reduce the risk of introducing another race somewhere else.

) * fix: extend the CAPI warmup timeout When starting up CAPI will try to "warm-up" the configured RuntimeExtensions, and fail if any runtime extension Pods are not up. * fix: extend the clusterctl timeout This should fix the race conditions seen when creating and upgrading CAPI providers.

dkoshkin added 2 commits October 1, 2024 09:45

fix: extend the CAPI warmup timeout

ccac478

When starting up CAPI will try to "warm-up" the configured RuntimeExtensions, and fail if any runtime extension Pods are not up.

fix: extend the clusterctl timeout

6c690d1

This should fix the race conditions seen when creating and upgrading CAPI providers.

dkoshkin requested review from jimmidyson, dlipovetsky, thunderboltsid and supershal October 1, 2024 16:56

dkoshkin changed the title ~~fix: increase the timeout when creating and upgrading CAPI controllers~~ 🐛 increase the timeout when creating and upgrading CAPI controllers Oct 1, 2024

supershal approved these changes Oct 1, 2024

View reviewed changes

jimmidyson reviewed Oct 1, 2024

View reviewed changes

thunderboltsid approved these changes Oct 1, 2024

View reviewed changes

dkoshkin merged commit 4151009 into d2iq/release-1.7.4 Oct 1, 2024
10 of 11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 increase the timeout when creating and upgrading CAPI controllers #27

🐛 increase the timeout when creating and upgrading CAPI controllers #27

dkoshkin commented Oct 1, 2024

supershal left a comment

jimmidyson Oct 1, 2024

dkoshkin Oct 1, 2024

🐛 increase the timeout when creating and upgrading CAPI controllers #27

🐛 increase the timeout when creating and upgrading CAPI controllers #27

Conversation

dkoshkin commented Oct 1, 2024

supershal left a comment

Choose a reason for hiding this comment

jimmidyson Oct 1, 2024

Choose a reason for hiding this comment

dkoshkin Oct 1, 2024

Choose a reason for hiding this comment