✨ Add AWSMachines to back the ec2 instances in AWSMachinePools #4527

cnmcavoy · 2023-09-27T21:41:07Z

What type of PR is this?
/kind feature

What this PR does / why we need it: Implements the MachinePool Machines clusterAPI proposal

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #4184

Special notes for your reviewer:

Checklist:

squashed commits
includes documentation
adds unit tests
adds or updates e2e tests

Release note:

Added AWSMachines to back the ec2 instances in AWSMachinePools and AWSManagedMachinePools.

k8s-ci-robot · 2023-09-27T21:41:16Z

Hi @cnmcavoy. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cnmcavoy · 2023-09-27T21:41:38Z

~~Still missing new tests to cover the functionality, have been testing locally with tilt~~

Edit: tests have been added.

Skarlso · 2023-09-28T05:10:54Z

/ok-to-test

Skarlso · 2023-10-13T04:29:23Z

/assign

exp/controllers/awsmachinepool_controller.go

Skarlso · 2023-10-16T11:53:10Z

exp/controllers/awsmachinepool_controller.go

+			machine, err := util.GetOwnerMachine(ctx, client, awsMachine.ObjectMeta)
+			if err != nil {
+				return fmt.Errorf("failed to get owner machine for %s/%s: %w", awsMachine.Namespace, awsMachine.Name, err)
+			}
+			log.V(2).Info("Deleting orphaned machine", "machine", machine, "awsmachine", awsMachine, "ProviderID", providerID)
+			if machine == nil {
+				// XXX(cmcavoy): if we got here, something went wrong with the owner reference
+				if err := client.Delete(ctx, &awsMachine); err != nil {
+					return fmt.Errorf("failed to delete orphan awsMachine %s/%s: %w", awsMachine.Namespace, awsMachine.Name, err)
+				}
+				continue
+			}
+
+			if err := client.Delete(ctx, machine); err != nil {
+				return fmt.Errorf("failed to delete orphan machine %s/%s: %w", machine.Namespace, machine.Name, err)
+			}


So if I understand this correctly...

If there is a machine for which there is no provider... check if there is an owner. If there is an owner, we delete the owner? And then we hope that cascading delete will also delete this machine?

That is how I parsed the proposal language:

When a MachinePool Machine is deleted manually, the system will delete the corresponding provider-specific resource. The opposite is also true: when a provider-specific resource is deleted, the system will delete the corresponding MachinePool Machine. This happens by virtue of the infrastructureRef <-> ownerRef relationship.

https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20220209-machinepool-machines.md#proposal

@Jont828 @mboersma @devigned wrote the proposal and perhaps can provide clarity ...?

What is the expected outcome when an AWSMachine and Machine pair, which is owned by a MachinePool, when it's ec2 instance is destroyed via an external action (e.g user terminates the instance in the aws console). Should the infrastructure provider (CAPA) detect this termination and delete the CAPI Machine resource?

Yes, if an AWSMachine's associated provider instance is deleted, we expect the AWSMachinePool to delete the Machine and AWSMachine pair. This is so that we are not left with hanging resources in the event of an out-of-band instance deletion like you described.

Yes, but... if we delete its owner, is it okay that any OTHER machine this thing owns will also get deleted?

I'm not quite sure I understand. Are you saying that if you delete the CAPI Machine, i.e. owner of the AWS Machine, you want to also delete another CAPI Machine associated with the same instance?

No, I mean is it a problem if the deleted owner owns more machines? Those would also be deleted then.

When you delete the CAPI Machine, we expect the CAPI Machine controller to delete the AWSMachine. It would be the same behavior we already have with MachineDeployments except there is no KubeadmConfig object to delete. Does that answer your question?

pkg/cloud/scope/machine.go

Ankitasw · 2024-02-01T06:18:18Z

@cnmcavoy is there a final review pending for this PR, or any other items?

cnmcavoy · 2024-02-01T16:27:18Z

@Ankitasw I think all the questions have been answered, it probably needs a new reviewer to give it a look over for a lgtm. It's ready to be merged if there are no more review items to address.

AndiDog · 2024-04-16T19:12:43Z

I can have a look into this, @cnmcavoy – do you want to resolve conflicts first?

cnmcavoy · 2024-04-16T21:25:27Z

I can have a look into this, @cnmcavoy – do you want to resolve conflicts first?

Sure, I think I got them all sorted.

k8s-triage-robot · 2024-07-15T22:25:33Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

AndiDog

I reviewed most of the PR except the test and managed machine pool code. Sorry, I have quite a few questions and comments 😉.

Overall, this looks pretty good. Did you successfully test this already?

AndiDog · 2024-07-11T16:31:44Z

controllers/awsmachine_controller.go

+		err = errors.New("no instance found for machine pool")
+		machineScope.Error(err, "unable to find instance")
+		conditions.MarkUnknown(machineScope.AWSMachine, infrav1.InstanceReadyCondition, infrav1.InstanceNotFoundReason, err.Error())


When you tested the new feature, did this code block spam the logs with lots of errors? I can imagine that a missing EC2 instance can happen a lot if the ASG terminates instances. Or does this not happen because CAPA watches for deletion events?

Wondering because findInstance has no special logic for a "not found" error which would allow us to swallow those.

And shouldn't we delete the Machine (and thereby InfraMachine) in this case (EC2 instance went away, so it shouldn't be represented as object anymore)? Otherwise the object may dangle indefinitely.

(TODO / note to self: the AWSMachinePool reconciler may do that, check in my code review)

I do not recall being spammed with errors, but I last tested this cod in a cluster 6 months ago.

exp/api/v1beta2/conditions_consts.go

exp/controllers/awsmachinepool_controller.go

AndiDog · 2024-07-16T08:05:21Z

exp/controllers/awsmachinepool_controller.go

+				},
+				// Note: this AWSMachine will be owned by the MachinePool until the MachinePool controller
+				// creates its parent Machine which will adopt this resource and replace the owner reference.
+				// We set the MachinePool as a temporary owner to prevent this from becoming an orphan resource.


Do we need to set BlockOwnerDeletion: true in the owner reference?

In CAPZ, that is done, and they're using AzureMachinePool, not MachinePool as the owner. I'm wondering which is better. Having CAPI's MachinePool controller own an AWS* resource seems wrong at first sight, since it won't delete the referenced object.

re: BlockOwnerDeletion I am not sure. I didn't set it bc this is supposed to be temporary ownership.

I'm not familiar with the CAPZ implementation. Do they have AzureMachinePool act as a long-lived owner, or does the ownership get passed to the Machine resource? I don't have a strong opinion here, I was trying to follow the specification as closely as possible.

Probably not worth the effort until we see issues.

CAPZ code reference – I'm also not famliiar with it.

Regarding this PR, is our temporary owner reference deleted anywhere? Or do we have both our reference and CAPI's Machine as owners? Once I grasp the concept here, I can maybe look through CAPZ or ask how it's done there, and why.

exp/controllers/awsmachinepool_controller.go

AndiDog · 2024-07-16T08:17:01Z

/remove-lifecycle stale

k8s-ci-robot · 2024-07-22T19:18:08Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign richardcase for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

AndiDog · 2024-07-26T13:11:52Z

exp/controllers/awsmachinepool_controller.go

+				},
+				// Note: this AWSMachine will be owned by the MachinePool until the MachinePool controller
+				// creates its parent Machine which will adopt this resource and replace the owner reference.
+				// We set the MachinePool as a temporary owner to prevent this from becoming an orphan resource.


Probably not worth the effort until we see issues.

CAPZ code reference – I'm also not famliiar with it.

Regarding this PR, is our temporary owner reference deleted anywhere? Or do we have both our reference and CAPI's Machine as owners? Once I grasp the concept here, I can maybe look through CAPZ or ask how it's done there, and why.

exp/controllers/awsmachinepool_controller.go

…anagedMachinePools

AndiDog · 2024-08-13T09:32:48Z

As an update: I'll try to test this major feature. PR looks mostly fine, I think. I'm only unsure about the owner reference in the non-happy case.

AndiDog · 2024-08-22T16:06:24Z

I got a working test environment, and the AWSMachine/Machine objects get created correctly 🍀.

First blocking issue I found: deletion by downscaling the ASG/AWSMachinePool leads to this permanent error rather than deleting the AWSMachine:

  failureMessage: EC2 instance state "terminated" is unexpected
  failureReason: UpdateError
  instanceState: terminated
  ready: false

The Machine/AWSMachine combo also can't be deleted due to:

capa_control… │ I0822 16:04:49.507270       1 awsmachine_controller.go:303] "Handling deleted AWSMachine"
capa_control… │ E0822 16:04:49.508449       1 awsmachine_controller.go:308] "unable to delete machine" err="failed to get raw userdata: error retrieving bootstrap data: linked Machine's bootstrap.dataSecretName is nil"

Other than that, the Node object went away and the cluster reacted according to the node deletion. @cnmcavoy what happened when you tested this? And which other cases do you think should be tested? I guess cluster-autoscaler scaling of ASGs may be a good external trigger that CAPA should have no problem with – I can test that.

cnmcavoy · 2024-08-27T21:52:47Z

@AndiDog thanks for taking a look. I think that's slightly concerning bc there should be test cases covering that sort of condition.

I suspect this PR has been left too long and the code has rotted. Indeed doesn't use ASGs anymore for autoscaling, so I can't really justify the time investment it would take to cut a new PR and start this fresh.

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 27, 2023

k8s-ci-robot requested review from Ankitasw and richardcase September 27, 2023 21:41

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 28, 2023

cnmcavoy force-pushed the cnmcavoy/awsmachinepool-awsmachines branch 3 times, most recently from 6578b66 to 16514f6 Compare October 12, 2023 18:39

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 12, 2023

cnmcavoy changed the title ~~WIP: Add AWSMachines to back the ec2 instances in AWSMachinePools~~ Add AWSMachines to back the ec2 instances in AWSMachinePools Oct 12, 2023

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 12, 2023

k8s-ci-robot assigned Skarlso Oct 13, 2023

Skarlso reviewed Oct 16, 2023

View reviewed changes

exp/controllers/awsmachinepool_controller.go Outdated Show resolved Hide resolved

Skarlso reviewed Oct 16, 2023

View reviewed changes

exp/controllers/awsmachinepool_controller.go Outdated Show resolved Hide resolved

Skarlso reviewed Oct 16, 2023

View reviewed changes

exp/controllers/awsmachinepool_controller.go Outdated Show resolved Hide resolved

Skarlso reviewed Oct 16, 2023

View reviewed changes

exp/controllers/awsmachinepool_controller.go Outdated Show resolved Hide resolved

Skarlso reviewed Oct 16, 2023

View reviewed changes

pkg/cloud/scope/machine.go Outdated Show resolved Hide resolved

cnmcavoy force-pushed the cnmcavoy/awsmachinepool-awsmachines branch from 16514f6 to f51d2ab Compare October 17, 2023 21:51

cnmcavoy force-pushed the cnmcavoy/awsmachinepool-awsmachines branch from 7cc41eb to cbfd8e2 Compare January 26, 2024 18:57

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 26, 2024

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 5, 2024

cnmcavoy force-pushed the cnmcavoy/awsmachinepool-awsmachines branch from cbfd8e2 to 4c6e311 Compare April 16, 2024 20:49

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 16, 2024

cnmcavoy force-pushed the cnmcavoy/awsmachinepool-awsmachines branch 2 times, most recently from 6632e0f to cc16742 Compare April 16, 2024 21:18

AndiDog mentioned this pull request Jul 15, 2024

✨feat(awsmachinepool): custom lifecyclehooks for machinepools #4875

Open

5 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 15, 2024

AndiDog reviewed Jul 16, 2024

View reviewed changes

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 16, 2024

cnmcavoy force-pushed the cnmcavoy/awsmachinepool-awsmachines branch from cc16742 to 032ac8a Compare July 22, 2024 19:18

cnmcavoy force-pushed the cnmcavoy/awsmachinepool-awsmachines branch 2 times, most recently from 35b3503 to 2c63fbb Compare July 22, 2024 19:37

cnmcavoy requested a review from AndiDog July 22, 2024 20:44

AndiDog reviewed Jul 26, 2024

View reviewed changes

Add AWSMachines to back the ec2 instances in AWSMachinePools and AWSM…

4b0964f

…anagedMachinePools

cnmcavoy force-pushed the cnmcavoy/awsmachinepool-awsmachines branch from 2c63fbb to 4b0964f Compare July 26, 2024 19:35

cnmcavoy closed this Aug 27, 2024

AndiDog mentioned this pull request Oct 22, 2024

✨ Add AWSMachines to back the EC2 instances in AWSMachinePools and AWSManagedMachinePools #5174

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ Add AWSMachines to back the ec2 instances in AWSMachinePools #4527

✨ Add AWSMachines to back the ec2 instances in AWSMachinePools #4527

cnmcavoy commented Sep 27, 2023 •

edited

Loading

k8s-ci-robot commented Sep 27, 2023

cnmcavoy commented Sep 27, 2023 •

edited

Loading

Skarlso commented Sep 28, 2023

Skarlso commented Oct 13, 2023

Skarlso Oct 16, 2023

cnmcavoy Oct 17, 2023

Jont828 Oct 18, 2023

Skarlso Oct 18, 2023

Jont828 Oct 18, 2023

Skarlso Oct 20, 2023

Jont828 Oct 24, 2023 •

edited

Loading

Ankitasw commented Feb 1, 2024

cnmcavoy commented Feb 1, 2024

AndiDog commented Apr 16, 2024

cnmcavoy commented Apr 16, 2024

k8s-triage-robot commented Jul 15, 2024

AndiDog left a comment

AndiDog Jul 11, 2024

cnmcavoy Jul 17, 2024

AndiDog Jul 16, 2024

cnmcavoy Jul 17, 2024

AndiDog Jul 26, 2024

AndiDog commented Jul 16, 2024

k8s-ci-robot commented Jul 22, 2024

AndiDog Jul 26, 2024

AndiDog commented Aug 13, 2024

AndiDog commented Aug 22, 2024 •

edited

Loading

cnmcavoy commented Aug 27, 2024

✨ Add AWSMachines to back the ec2 instances in AWSMachinePools #4527

✨ Add AWSMachines to back the ec2 instances in AWSMachinePools #4527

Conversation

cnmcavoy commented Sep 27, 2023 • edited Loading

k8s-ci-robot commented Sep 27, 2023

cnmcavoy commented Sep 27, 2023 • edited Loading

Skarlso commented Sep 28, 2023

Skarlso commented Oct 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jont828 Oct 24, 2023 • edited Loading

Choose a reason for hiding this comment

Ankitasw commented Feb 1, 2024

cnmcavoy commented Feb 1, 2024

AndiDog commented Apr 16, 2024

cnmcavoy commented Apr 16, 2024

k8s-triage-robot commented Jul 15, 2024

AndiDog left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndiDog commented Jul 16, 2024

k8s-ci-robot commented Jul 22, 2024

Choose a reason for hiding this comment

AndiDog commented Aug 13, 2024

AndiDog commented Aug 22, 2024 • edited Loading

cnmcavoy commented Aug 27, 2024

cnmcavoy commented Sep 27, 2023 •

edited

Loading

cnmcavoy commented Sep 27, 2023 •

edited

Loading

Jont828 Oct 24, 2023 •

edited

Loading

AndiDog commented Aug 22, 2024 •

edited

Loading