Support re-assignment of another failure domain when the machine failed to provision #353

jhaanvi5 · 2024-04-01T22:37:10Z

Issue #352

Description of changes: CAPC chooses a random failure domain to deploy worker machines. When this VM deploy fails, irrespective of the type of error, CAPC will keep re-attempting to deploy a VM until CAPI replaces the owner machine. This change introduces a failure domain balancer with fallback mechanism, primary balancer selects failure domain from network with most free IPs and if it fails it fallbacks to random failure domain selection.

Testing performed:
make test

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…ed to provision

linux-foundation-easycla · 2024-04-01T22:37:14Z

The committers listed above are authorized under a signed CLA.

✅ login: jhaanvi5 (565b107, b63e0f4, f480afe, 1793473, 5cc6663)

k8s-ci-robot · 2024-04-01T22:37:19Z

Welcome @jhaanvi5!

It looks like this is your first PR to kubernetes-sigs/cluster-api-provider-cloudstack 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api-provider-cloudstack has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2024-04-01T22:37:19Z

Hi @jhaanvi5. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

netlify · 2024-04-01T22:37:28Z

✅ Deploy Preview for kubernetes-sigs-cluster-api-cloudstack ready!

Name	Link
🔨 Latest commit	`5cc6663`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-cluster-api-cloudstack/deploys/664f92a1e117ce000890d2d3
😎 Deploy Preview	https://deploy-preview-353--kubernetes-sigs-cluster-api-cloudstack.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

jhaanvi5 · 2024-04-01T23:22:28Z

/cc @vignesh-goutham

chrisdoherty4 · 2024-04-02T14:38:23Z

/uncc @chrisdoherty4

vivek-koppuru · 2024-04-05T17:48:55Z

controllers/cloudstackmachine_controller.go

+	infrav1 "sigs.k8s.io/cluster-api-provider-cloudstack/api/v1beta3"
+	"sigs.k8s.io/cluster-api-provider-cloudstack/controllers/utils"
+	"sigs.k8s.io/cluster-api-provider-cloudstack/pkg/cloud"
+	cserrors "sigs.k8s.io/cluster-api-provider-cloudstack/pkg/errors"
+	"sigs.k8s.io/cluster-api-provider-cloudstack/pkg/failuredomains"


These should be in it's own section below right?

vivek-koppuru · 2024-04-05T21:00:15Z

pkg/failuredomains/client.go

+
+	clientConfig := &corev1.ConfigMap{}
+	key = client.ObjectKey{Name: cloud.ClientConfigMapName, Namespace: cloud.ClientConfigMapNamespace}
+	_ = f.Get(ctx, key, clientConfig)


Add a comment here mentioning that client config is optional, hence we can ignore the error

vignesh-goutham · 2024-04-02T15:46:18Z

api/v1beta3/cloudstackmachine_types.go

@@ -26,6 +26,8 @@ import (
 // The presence of a finalizer prevents CAPI from deleting the corresponding CAPI data.
 const MachineFinalizer = "cloudstackmachine.infrastructure.cluster.x-k8s.io"

+const MachineCreateFailAnnotation = "cluster.x-k8s.io/vm-create-failed"


Suggested change

const MachineCreateFailAnnotation = "cluster.x-k8s.io/vm-create-failed"

const MachineCreateFailedAnnotation = "cluster.x-k8s.io/vm-create-failed"

rohityadavcloud · 2024-04-11T03:34:07Z

/ok-to-test

chrisdoherty4 · 2024-04-30T13:45:45Z

/uncc chrisdoherty4

g-gaston · 2024-05-17T20:44:25Z

controllers/cloudstackmachine_controller.go

+	r.Log.Info("Marking machine as failed to launch", "csMachine", r.ReconciliationSubject.GetName())
+	r.ReconciliationSubject.MarkAsFailed()
+
+	if _, err := r.ReconcileDelete(); err != nil {


I don't follow this part. Why ReconcileDelete when the machine is marked as failed?

g-gaston · 2024-05-17T20:48:04Z

pkg/cloud/network.go

@@ -78,6 +83,26 @@ func (c *client) ResolveNetwork(net *infrav1.Network) (retErr error) {
 	return nil
 }

+// GetPublicIPs gets public IP addresses for the associated failure domain network


why failure domain network? this code doesn't seem to have anything special about failure domain, right? Wouldn't this work for any network?

g-gaston · 2024-05-17T20:49:11Z

pkg/errors/cloudstack.go

+	csErrorCodeRegexp, _ = regexp.Compile(".+CSExceptionErrorCode: ([0-9]+).+")
+
+	// List of error codes: https://docs.cloudstack.apache.org/en/latest/developersguide/dev.html#error-handling
+	csTerminalErrorCodes = strings.Split(getEnv("CLOUDSTACK_TERMINAL_FAILURE_CODES", "4250,9999"), ",")


what's the reason to make this configurable?

g-gaston · 2024-05-17T20:50:55Z

pkg/errors/cloudstack_test.go

@@ -0,0 +1,62 @@
+/*
+Copyright 2022 The Kubernetes Authors.


Suggested change

Copyright 2022 The Kubernetes Authors.

Copyright 2024 The Kubernetes Authors.

🤷🏻‍♂️

g-gaston · 2024-05-17T21:01:17Z

pkg/failuredomains/balancer.go

+)
+
+type Balancer interface {
+	Assign(ctx context.Context, csMachine *infrav1.CloudStackMachine, capiMachine *clusterv1.Machine, fds []infrav1.CloudStackFailureDomainSpec) error


What does entail to assign a failure domain? Only setting the field in the machines?

In that case I would vote to rewrite this to return the actual failure domain, and let the caller set the field in the machine. This is just a personal opinion, but I think that minimizing side effects makes things easier to understand. In this case, it separates the logic to decide a failure domain from the code to set the failure domain.

g-gaston · 2024-05-17T21:04:01Z

pkg/failuredomains/balancer.go

+	return newReassigningFailureDomainBalancer(newFallingBackFailureDomainBalancer(
+		newFreeIPValidatingFailureDomainBalancer(csClientFactory),
+		newRandomFailureDomainBalancer(),
+	))


I appreciate the idea of using composition here. However, what's the benefit in this case?

It doesn't seem like we are reusing any of the Balancers. Do we expect to have to use them in a different way at some point? Can we actually use them separately? in other words, are they truly independent?

If it's just to decompose the problem into smaller parts? Does that make the program simpler or more complex?

g-gaston · 2024-05-17T21:06:00Z

pkg/failuredomains/balancer.go

+	if capiMachineHasFailureDomain(capiMachine) && !csMachine.HasFailed() {
+		csMachine.Spec.FailureDomainName = *capiMachine.Spec.FailureDomain
+		assignFailureDomainLabel(csMachine, capiMachine)
+	} else if err := r.delegate.Assign(ctx, csMachine, capiMachine, fds); err != nil {
+		return err
+	}
+
+	if capiMachineHasFailureDomain(capiMachine) && csMachine.HasFailed() {
+		capiMachine.Spec.FailureDomain = pointer.String(csMachine.Spec.FailureDomainName)
+	}


I'll admit I'm not an expert on this are of the code base, but the intent here is not clear to me. If this is the best way to write this logic, I think it could benefit from some comments to explain what it's supposed to be doing (at a higher level than the code).

g-gaston · 2024-05-17T21:08:42Z

pkg/failuredomains/balancer.go

+	))
+}
+
+type reassigningFailureDomainBalancer struct {


I think all this implementations of Balancer could use some go docs explaining what they do.

vishesh92 · 2024-05-22T07:01:35Z

@jhaanvi5 What is the status/ETA for this PR?
I am going through the open issues and tasks for the next release of CAPC (v0.5.0). It would be good to have an estimate to decided whether to include this in v0.5.0 or not.

cc: @g-gaston @vignesh-goutham @vivek-koppuru

Improve destroy VM reconciliation to reduce duplicated async jobs

k8s-ci-robot · 2024-05-23T18:43:56Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jhaanvi5, vivek-koppuru
Once this PR has been reviewed and has the lgtm label, please ask for approval from vishesh92. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

vivek-koppuru · 2024-05-23T23:01:09Z

@vishesh92 Figuring this out, I think this is done for the most part so it would be great to get your review as well on this. This came as an ask so need to follow up based on whether any of these changed can't make it to the v0.5.0 release, but we would love to have it included pending review from you all!

k8s-ci-robot · 2024-05-28T22:48:26Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

vishesh92 · 2024-05-31T21:19:02Z

@vishesh92 Figuring this out, I think this is done for the most part so it would be great to get your review as well on this. This came as an ask so need to follow up based on whether any of these changed can't make it to the v0.5.0 release, but we would love to have it included pending review from you all!

@vivek-koppuru I reviewed this PR a little bit and have same questions as @g-gaston . Could you please resolve/address the comments? This PR also has some conflicts now.

k8s-triage-robot · 2024-08-29T22:08:26Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-09-28T22:48:24Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Support re-assignment of another failure domain when the machine fail…

565b107

…ed to provision

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 1, 2024

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Apr 1, 2024

k8s-ci-robot requested review from chrisdoherty4 and rohityadavcloud April 1, 2024 22:37

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 1, 2024

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 1, 2024

k8s-ci-robot requested a review from vignesh-goutham April 1, 2024 23:22

k8s-ci-robot removed the request for review from chrisdoherty4 April 2, 2024 14:38

vivek-koppuru reviewed Apr 5, 2024

View reviewed changes

vignesh-goutham reviewed Apr 5, 2024

View reviewed changes

Update const name for vm-create-failed annotation and minor fixup

f480afe

jhaanvi5 marked this pull request as ready for review April 8, 2024 16:21

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 8, 2024

k8s-ci-robot requested review from chrisdoherty4 and dims April 8, 2024 16:21

fix: reorganize imports

1793473

vivek-koppuru approved these changes Apr 9, 2024

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 11, 2024

k8s-ci-robot removed the request for review from chrisdoherty4 April 30, 2024 13:45

rohityadavcloud added this to the v0.5.0 milestone May 3, 2024

rohityadavcloud requested a review from vishesh92 May 3, 2024 05:50

g-gaston self-requested a review May 8, 2024 13:31

g-gaston reviewed May 17, 2024

View reviewed changes

rohityadavcloud added the release:must-have label May 20, 2024

rohityadavcloud assigned vishesh92 May 20, 2024

Fix cloud instance delete VM test

b63e0f4

Improve destroy VM reconciliation to reduce duplicated async jobs

fix: remove duplicate import

5cc6663

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 28, 2024

vishesh92 modified the milestones: v0.5.0, v0.6 May 31, 2024

vishesh92 linked an issue May 31, 2024 that may be closed by this pull request

Failure domain re-assignment on Cloudstack machine deploy failures #352

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 29, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support re-assignment of another failure domain when the machine failed to provision #353

Support re-assignment of another failure domain when the machine failed to provision #353

jhaanvi5 commented Apr 1, 2024 •

edited

Loading

linux-foundation-easycla bot commented Apr 1, 2024 •

edited

Loading

k8s-ci-robot commented Apr 1, 2024

k8s-ci-robot commented Apr 1, 2024

netlify bot commented Apr 1, 2024 •

edited

Loading

jhaanvi5 commented Apr 1, 2024

chrisdoherty4 commented Apr 2, 2024

vivek-koppuru Apr 5, 2024

vivek-koppuru Apr 5, 2024

vignesh-goutham Apr 2, 2024

rohityadavcloud commented Apr 11, 2024

chrisdoherty4 commented Apr 30, 2024

g-gaston May 17, 2024

g-gaston May 17, 2024

g-gaston May 17, 2024

g-gaston May 17, 2024

g-gaston May 17, 2024

g-gaston May 17, 2024

g-gaston May 17, 2024

g-gaston May 17, 2024

vishesh92 commented May 22, 2024

k8s-ci-robot commented May 23, 2024

vivek-koppuru commented May 23, 2024

k8s-ci-robot commented May 28, 2024

vishesh92 commented May 31, 2024

k8s-triage-robot commented Aug 29, 2024

k8s-triage-robot commented Sep 28, 2024

	const MachineCreateFailAnnotation = "cluster.x-k8s.io/vm-create-failed"
	const MachineCreateFailedAnnotation = "cluster.x-k8s.io/vm-create-failed"

	Copyright 2022 The Kubernetes Authors.
	Copyright 2024 The Kubernetes Authors.

Support re-assignment of another failure domain when the machine failed to provision #353

Are you sure you want to change the base?

Support re-assignment of another failure domain when the machine failed to provision #353

Conversation

jhaanvi5 commented Apr 1, 2024 • edited Loading

linux-foundation-easycla bot commented Apr 1, 2024 • edited Loading

k8s-ci-robot commented Apr 1, 2024

k8s-ci-robot commented Apr 1, 2024

netlify bot commented Apr 1, 2024 • edited Loading

✅ Deploy Preview for kubernetes-sigs-cluster-api-cloudstack ready!

jhaanvi5 commented Apr 1, 2024

chrisdoherty4 commented Apr 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rohityadavcloud commented Apr 11, 2024

chrisdoherty4 commented Apr 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vishesh92 commented May 22, 2024

k8s-ci-robot commented May 23, 2024

vivek-koppuru commented May 23, 2024

k8s-ci-robot commented May 28, 2024

vishesh92 commented May 31, 2024

k8s-triage-robot commented Aug 29, 2024

k8s-triage-robot commented Sep 28, 2024

jhaanvi5 commented Apr 1, 2024 •

edited

Loading

linux-foundation-easycla bot commented Apr 1, 2024 •

edited

Loading

netlify bot commented Apr 1, 2024 •

edited

Loading