Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support re-assignment of another failure domain when the machine failed to provision #353

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

jhaanvi5
Copy link

@jhaanvi5 jhaanvi5 commented Apr 1, 2024

Issue #352

Description of changes: CAPC chooses a random failure domain to deploy worker machines. When this VM deploy fails, irrespective of the type of error, CAPC will keep re-attempting to deploy a VM until CAPI replaces the owner machine. This change introduces a failure domain balancer with fallback mechanism, primary balancer selects failure domain from network with most free IPs and if it fails it fallbacks to random failure domain selection.

Testing performed:
make test

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 1, 2024
Copy link

linux-foundation-easycla bot commented Apr 1, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Apr 1, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @jhaanvi5!

It looks like this is your first PR to kubernetes-sigs/cluster-api-provider-cloudstack 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api-provider-cloudstack has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @jhaanvi5. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 1, 2024
Copy link

netlify bot commented Apr 1, 2024

Deploy Preview for kubernetes-sigs-cluster-api-cloudstack ready!

Name Link
🔨 Latest commit 5cc6663
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-cluster-api-cloudstack/deploys/664f92a1e117ce000890d2d3
😎 Deploy Preview https://deploy-preview-353--kubernetes-sigs-cluster-api-cloudstack.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 1, 2024
@jhaanvi5
Copy link
Author

jhaanvi5 commented Apr 1, 2024

/cc @vignesh-goutham

@chrisdoherty4
Copy link
Member

/uncc @chrisdoherty4

@k8s-ci-robot k8s-ci-robot removed the request for review from chrisdoherty4 April 2, 2024 14:38
Comment on lines 31 to 35
infrav1 "sigs.k8s.io/cluster-api-provider-cloudstack/api/v1beta3"
"sigs.k8s.io/cluster-api-provider-cloudstack/controllers/utils"
"sigs.k8s.io/cluster-api-provider-cloudstack/pkg/cloud"
cserrors "sigs.k8s.io/cluster-api-provider-cloudstack/pkg/errors"
"sigs.k8s.io/cluster-api-provider-cloudstack/pkg/failuredomains"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be in it's own section below right?


clientConfig := &corev1.ConfigMap{}
key = client.ObjectKey{Name: cloud.ClientConfigMapName, Namespace: cloud.ClientConfigMapNamespace}
_ = f.Get(ctx, key, clientConfig)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment here mentioning that client config is optional, hence we can ignore the error

@@ -26,6 +26,8 @@ import (
// The presence of a finalizer prevents CAPI from deleting the corresponding CAPI data.
const MachineFinalizer = "cloudstackmachine.infrastructure.cluster.x-k8s.io"

const MachineCreateFailAnnotation = "cluster.x-k8s.io/vm-create-failed"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const MachineCreateFailAnnotation = "cluster.x-k8s.io/vm-create-failed"
const MachineCreateFailedAnnotation = "cluster.x-k8s.io/vm-create-failed"

@jhaanvi5 jhaanvi5 marked this pull request as ready for review April 8, 2024 16:21
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 8, 2024
@rohityadavcloud
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 11, 2024
@chrisdoherty4
Copy link
Member

/uncc chrisdoherty4

@k8s-ci-robot k8s-ci-robot removed the request for review from chrisdoherty4 April 30, 2024 13:45
@rohityadavcloud rohityadavcloud added this to the v0.5.0 milestone May 3, 2024
@g-gaston g-gaston self-requested a review May 8, 2024 13:31
r.Log.Info("Marking machine as failed to launch", "csMachine", r.ReconciliationSubject.GetName())
r.ReconciliationSubject.MarkAsFailed()

if _, err := r.ReconcileDelete(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow this part. Why ReconcileDelete when the machine is marked as failed?

@@ -78,6 +83,26 @@ func (c *client) ResolveNetwork(net *infrav1.Network) (retErr error) {
return nil
}

// GetPublicIPs gets public IP addresses for the associated failure domain network
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why failure domain network? this code doesn't seem to have anything special about failure domain, right? Wouldn't this work for any network?

csErrorCodeRegexp, _ = regexp.Compile(".+CSExceptionErrorCode: ([0-9]+).+")

// List of error codes: https://docs.cloudstack.apache.org/en/latest/developersguide/dev.html#error-handling
csTerminalErrorCodes = strings.Split(getEnv("CLOUDSTACK_TERMINAL_FAILURE_CODES", "4250,9999"), ",")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the reason to make this configurable?

@@ -0,0 +1,62 @@
/*
Copyright 2022 The Kubernetes Authors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Copyright 2022 The Kubernetes Authors.
Copyright 2024 The Kubernetes Authors.

🤷🏻‍♂️

)

type Balancer interface {
Assign(ctx context.Context, csMachine *infrav1.CloudStackMachine, capiMachine *clusterv1.Machine, fds []infrav1.CloudStackFailureDomainSpec) error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does entail to assign a failure domain? Only setting the field in the machines?

In that case I would vote to rewrite this to return the actual failure domain, and let the caller set the field in the machine. This is just a personal opinion, but I think that minimizing side effects makes things easier to understand. In this case, it separates the logic to decide a failure domain from the code to set the failure domain.

Comment on lines +39 to +42
return newReassigningFailureDomainBalancer(newFallingBackFailureDomainBalancer(
newFreeIPValidatingFailureDomainBalancer(csClientFactory),
newRandomFailureDomainBalancer(),
))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the idea of using composition here. However, what's the benefit in this case?

It doesn't seem like we are reusing any of the Balancers. Do we expect to have to use them in a different way at some point? Can we actually use them separately? in other words, are they truly independent?

If it's just to decompose the problem into smaller parts? Does that make the program simpler or more complex?

Comment on lines +65 to +74
if capiMachineHasFailureDomain(capiMachine) && !csMachine.HasFailed() {
csMachine.Spec.FailureDomainName = *capiMachine.Spec.FailureDomain
assignFailureDomainLabel(csMachine, capiMachine)
} else if err := r.delegate.Assign(ctx, csMachine, capiMachine, fds); err != nil {
return err
}

if capiMachineHasFailureDomain(capiMachine) && csMachine.HasFailed() {
capiMachine.Spec.FailureDomain = pointer.String(csMachine.Spec.FailureDomainName)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll admit I'm not an expert on this are of the code base, but the intent here is not clear to me. If this is the best way to write this logic, I think it could benefit from some comments to explain what it's supposed to be doing (at a higher level than the code).

))
}

type reassigningFailureDomainBalancer struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all this implementations of Balancer could use some go docs explaining what they do.

@vishesh92
Copy link
Member

@jhaanvi5 What is the status/ETA for this PR?
I am going through the open issues and tasks for the next release of CAPC (v0.5.0). It would be good to have an estimate to decided whether to include this in v0.5.0 or not.

cc: @g-gaston @vignesh-goutham @vivek-koppuru

Improve destroy VM reconciliation to reduce duplicated async jobs
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jhaanvi5, vivek-koppuru
Once this PR has been reviewed and has the lgtm label, please ask for approval from vishesh92. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@vivek-koppuru
Copy link

@vishesh92 Figuring this out, I think this is done for the most part so it would be great to get your review as well on this. This came as an ask so need to follow up based on whether any of these changed can't make it to the v0.5.0 release, but we would love to have it included pending review from you all!

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 28, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@vishesh92
Copy link
Member

@vishesh92 Figuring this out, I think this is done for the most part so it would be great to get your review as well on this. This came as an ask so need to follow up based on whether any of these changed can't make it to the v0.5.0 release, but we would love to have it included pending review from you all!

@vivek-koppuru I reviewed this PR a little bit and have same questions as @g-gaston . Could you please resolve/address the comments? This PR also has some conflicts now.

@vishesh92 vishesh92 modified the milestones: v0.5.0, v0.6 May 31, 2024
@vishesh92 vishesh92 linked an issue May 31, 2024 that may be closed by this pull request
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 29, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release:must-have size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failure domain re-assignment on Cloudstack machine deploy failures
9 participants