-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Add AWSMachines to back the ec2 instances in AWSMachinePools #4527
✨ Add AWSMachines to back the ec2 instances in AWSMachinePools #4527
Conversation
Hi @cnmcavoy. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Edit: tests have been added. |
/ok-to-test |
6578b66
to
16514f6
Compare
/assign |
machine, err := util.GetOwnerMachine(ctx, client, awsMachine.ObjectMeta) | ||
if err != nil { | ||
return fmt.Errorf("failed to get owner machine for %s/%s: %w", awsMachine.Namespace, awsMachine.Name, err) | ||
} | ||
log.V(2).Info("Deleting orphaned machine", "machine", machine, "awsmachine", awsMachine, "ProviderID", providerID) | ||
if machine == nil { | ||
// XXX(cmcavoy): if we got here, something went wrong with the owner reference | ||
if err := client.Delete(ctx, &awsMachine); err != nil { | ||
return fmt.Errorf("failed to delete orphan awsMachine %s/%s: %w", awsMachine.Namespace, awsMachine.Name, err) | ||
} | ||
continue | ||
} | ||
|
||
if err := client.Delete(ctx, machine); err != nil { | ||
return fmt.Errorf("failed to delete orphan machine %s/%s: %w", machine.Namespace, machine.Name, err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I understand this correctly...
If there is a machine for which there is no provider... check if there is an owner. If there is an owner, we delete the owner? And then we hope that cascading delete will also delete this machine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is how I parsed the proposal language:
When a MachinePool Machine is deleted manually, the system will delete the corresponding provider-specific resource. The opposite is also true: when a provider-specific resource is deleted, the system will delete the corresponding MachinePool Machine. This happens by virtue of the infrastructureRef <-> ownerRef relationship.
@Jont828 @mboersma @devigned wrote the proposal and perhaps can provide clarity ...?
What is the expected outcome when an AWSMachine and Machine pair, which is owned by a MachinePool, when it's ec2 instance is destroyed via an external action (e.g user terminates the instance in the aws console). Should the infrastructure provider (CAPA) detect this termination and delete the CAPI Machine resource?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if an AWSMachine's associated provider instance is deleted, we expect the AWSMachinePool to delete the Machine and AWSMachine pair. This is so that we are not left with hanging resources in the event of an out-of-band instance deletion like you described.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but... if we delete its owner, is it okay that any OTHER machine this thing owns will also get deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure I understand. Are you saying that if you delete the CAPI Machine, i.e. owner of the AWS Machine, you want to also delete another CAPI Machine associated with the same instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I mean is it a problem if the deleted owner owns more machines? Those would also be deleted then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you delete the CAPI Machine, we expect the CAPI Machine controller to delete the AWSMachine. It would be the same behavior we already have with MachineDeployments except there is no KubeadmConfig object to delete. Does that answer your question?
16514f6
to
f51d2ab
Compare
7cc41eb
to
cbfd8e2
Compare
@cnmcavoy is there a final review pending for this PR, or any other items? |
@Ankitasw I think all the questions have been answered, it probably needs a new reviewer to give it a look over for a lgtm. It's ready to be merged if there are no more review items to address. |
I can have a look into this, @cnmcavoy – do you want to resolve conflicts first? |
cbfd8e2
to
4c6e311
Compare
6632e0f
to
cc16742
Compare
Sure, I think I got them all sorted. |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed most of the PR except the test and managed machine pool code. Sorry, I have quite a few questions and comments 😉.
Overall, this looks pretty good. Did you successfully test this already?
err = errors.New("no instance found for machine pool") | ||
machineScope.Error(err, "unable to find instance") | ||
conditions.MarkUnknown(machineScope.AWSMachine, infrav1.InstanceReadyCondition, infrav1.InstanceNotFoundReason, err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you tested the new feature, did this code block spam the logs with lots of errors? I can imagine that a missing EC2 instance can happen a lot if the ASG terminates instances. Or does this not happen because CAPA watches for deletion events?
Wondering because findInstance
has no special logic for a "not found" error which would allow us to swallow those.
And shouldn't we delete the Machine
(and thereby InfraMachine
) in this case (EC2 instance went away, so it shouldn't be represented as object anymore)? Otherwise the object may dangle indefinitely.
(TODO / note to self: the AWSMachinePool reconciler may do that, check in my code review)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not recall being spammed with errors, but I last tested this cod in a cluster 6 months ago.
}, | ||
// Note: this AWSMachine will be owned by the MachinePool until the MachinePool controller | ||
// creates its parent Machine which will adopt this resource and replace the owner reference. | ||
// We set the MachinePool as a temporary owner to prevent this from becoming an orphan resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to set BlockOwnerDeletion: true
in the owner reference?
In CAPZ, that is done, and they're using AzureMachinePool
, not MachinePool
as the owner. I'm wondering which is better. Having CAPI's MachinePool
controller own an AWS*
resource seems wrong at first sight, since it won't delete the referenced object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: BlockOwnerDeletion
I am not sure. I didn't set it bc this is supposed to be temporary ownership.
I'm not familiar with the CAPZ implementation. Do they have AzureMachinePool
act as a long-lived owner, or does the ownership get passed to the Machine
resource? I don't have a strong opinion here, I was trying to follow the specification as closely as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not worth the effort until we see issues.
CAPZ code reference – I'm also not famliiar with it.
Regarding this PR, is our temporary owner reference deleted anywhere? Or do we have both our reference and CAPI's Machine
as owners? Once I grasp the concept here, I can maybe look through CAPZ or ask how it's done there, and why.
/remove-lifecycle stale |
cc16742
to
032ac8a
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
35b3503
to
2c63fbb
Compare
}, | ||
// Note: this AWSMachine will be owned by the MachinePool until the MachinePool controller | ||
// creates its parent Machine which will adopt this resource and replace the owner reference. | ||
// We set the MachinePool as a temporary owner to prevent this from becoming an orphan resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not worth the effort until we see issues.
CAPZ code reference – I'm also not famliiar with it.
Regarding this PR, is our temporary owner reference deleted anywhere? Or do we have both our reference and CAPI's Machine
as owners? Once I grasp the concept here, I can maybe look through CAPZ or ask how it's done there, and why.
…anagedMachinePools
2c63fbb
to
4b0964f
Compare
As an update: I'll try to test this major feature. PR looks mostly fine, I think. I'm only unsure about the owner reference in the non-happy case. |
I got a working test environment, and the First blocking issue I found: deletion by downscaling the ASG/AWSMachinePool leads to this permanent error rather than deleting the
The
Other than that, the |
@AndiDog thanks for taking a look. I think that's slightly concerning bc there should be test cases covering that sort of condition. I suspect this PR has been left too long and the code has rotted. Indeed doesn't use ASGs anymore for autoscaling, so I can't really justify the time investment it would take to cut a new PR and start this fresh. |
What type of PR is this?
/kind feature
What this PR does / why we need it: Implements the MachinePool Machines clusterAPI proposal
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #4184
Special notes for your reviewer:
Checklist:
Release note: