Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Add AWSMachines to back the ec2 instances in AWSMachinePools #4527

Conversation

cnmcavoy
Copy link
Contributor

@cnmcavoy cnmcavoy commented Sep 27, 2023

What type of PR is this?
/kind feature

What this PR does / why we need it: Implements the MachinePool Machines clusterAPI proposal

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #4184

Special notes for your reviewer:

Checklist:

  • squashed commits
  • includes documentation
  • adds unit tests
  • adds or updates e2e tests

Release note:

Added AWSMachines to back the ec2 instances in AWSMachinePools and AWSManagedMachinePools.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 27, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @cnmcavoy. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 27, 2023
@cnmcavoy
Copy link
Contributor Author

cnmcavoy commented Sep 27, 2023

Still missing new tests to cover the functionality, have been testing locally with tilt

Edit: tests have been added.

@Skarlso
Copy link
Contributor

Skarlso commented Sep 28, 2023

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 28, 2023
@cnmcavoy cnmcavoy force-pushed the cnmcavoy/awsmachinepool-awsmachines branch 3 times, most recently from 6578b66 to 16514f6 Compare October 12, 2023 18:39
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 12, 2023
@cnmcavoy cnmcavoy changed the title WIP: Add AWSMachines to back the ec2 instances in AWSMachinePools Add AWSMachines to back the ec2 instances in AWSMachinePools Oct 12, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 12, 2023
@Skarlso
Copy link
Contributor

Skarlso commented Oct 13, 2023

/assign

Comment on lines 564 to 591
machine, err := util.GetOwnerMachine(ctx, client, awsMachine.ObjectMeta)
if err != nil {
return fmt.Errorf("failed to get owner machine for %s/%s: %w", awsMachine.Namespace, awsMachine.Name, err)
}
log.V(2).Info("Deleting orphaned machine", "machine", machine, "awsmachine", awsMachine, "ProviderID", providerID)
if machine == nil {
// XXX(cmcavoy): if we got here, something went wrong with the owner reference
if err := client.Delete(ctx, &awsMachine); err != nil {
return fmt.Errorf("failed to delete orphan awsMachine %s/%s: %w", awsMachine.Namespace, awsMachine.Name, err)
}
continue
}

if err := client.Delete(ctx, machine); err != nil {
return fmt.Errorf("failed to delete orphan machine %s/%s: %w", machine.Namespace, machine.Name, err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand this correctly...

If there is a machine for which there is no provider... check if there is an owner. If there is an owner, we delete the owner? And then we hope that cascading delete will also delete this machine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is how I parsed the proposal language:

When a MachinePool Machine is deleted manually, the system will delete the corresponding provider-specific resource. The opposite is also true: when a provider-specific resource is deleted, the system will delete the corresponding MachinePool Machine. This happens by virtue of the infrastructureRef <-> ownerRef relationship.

https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20220209-machinepool-machines.md#proposal

@Jont828 @mboersma @devigned wrote the proposal and perhaps can provide clarity ...?

What is the expected outcome when an AWSMachine and Machine pair, which is owned by a MachinePool, when it's ec2 instance is destroyed via an external action (e.g user terminates the instance in the aws console). Should the infrastructure provider (CAPA) detect this termination and delete the CAPI Machine resource?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if an AWSMachine's associated provider instance is deleted, we expect the AWSMachinePool to delete the Machine and AWSMachine pair. This is so that we are not left with hanging resources in the event of an out-of-band instance deletion like you described.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but... if we delete its owner, is it okay that any OTHER machine this thing owns will also get deleted?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure I understand. Are you saying that if you delete the CAPI Machine, i.e. owner of the AWS Machine, you want to also delete another CAPI Machine associated with the same instance?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I mean is it a problem if the deleted owner owns more machines? Those would also be deleted then.

Copy link

@Jont828 Jont828 Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you delete the CAPI Machine, we expect the CAPI Machine controller to delete the AWSMachine. It would be the same behavior we already have with MachineDeployments except there is no KubeadmConfig object to delete. Does that answer your question?

pkg/cloud/scope/machine.go Outdated Show resolved Hide resolved
@cnmcavoy cnmcavoy force-pushed the cnmcavoy/awsmachinepool-awsmachines branch from 16514f6 to f51d2ab Compare October 17, 2023 21:51
@cnmcavoy cnmcavoy force-pushed the cnmcavoy/awsmachinepool-awsmachines branch from 7cc41eb to cbfd8e2 Compare January 26, 2024 18:57
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 26, 2024
@Ankitasw
Copy link
Member

Ankitasw commented Feb 1, 2024

@cnmcavoy is there a final review pending for this PR, or any other items?

@cnmcavoy
Copy link
Contributor Author

cnmcavoy commented Feb 1, 2024

@Ankitasw I think all the questions have been answered, it probably needs a new reviewer to give it a look over for a lgtm. It's ready to be merged if there are no more review items to address.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 5, 2024
@AndiDog
Copy link
Contributor

AndiDog commented Apr 16, 2024

I can have a look into this, @cnmcavoy – do you want to resolve conflicts first?

@cnmcavoy cnmcavoy force-pushed the cnmcavoy/awsmachinepool-awsmachines branch from cbfd8e2 to 4c6e311 Compare April 16, 2024 20:49
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 16, 2024
@cnmcavoy cnmcavoy force-pushed the cnmcavoy/awsmachinepool-awsmachines branch 2 times, most recently from 6632e0f to cc16742 Compare April 16, 2024 21:18
@cnmcavoy
Copy link
Contributor Author

I can have a look into this, @cnmcavoy – do you want to resolve conflicts first?

Sure, I think I got them all sorted.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 15, 2024
Copy link
Contributor

@AndiDog AndiDog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed most of the PR except the test and managed machine pool code. Sorry, I have quite a few questions and comments 😉.

Overall, this looks pretty good. Did you successfully test this already?

Comment on lines +494 to +505
err = errors.New("no instance found for machine pool")
machineScope.Error(err, "unable to find instance")
conditions.MarkUnknown(machineScope.AWSMachine, infrav1.InstanceReadyCondition, infrav1.InstanceNotFoundReason, err.Error())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you tested the new feature, did this code block spam the logs with lots of errors? I can imagine that a missing EC2 instance can happen a lot if the ASG terminates instances. Or does this not happen because CAPA watches for deletion events?

Wondering because findInstance has no special logic for a "not found" error which would allow us to swallow those.

And shouldn't we delete the Machine (and thereby InfraMachine) in this case (EC2 instance went away, so it shouldn't be represented as object anymore)? Otherwise the object may dangle indefinitely.

(TODO / note to self: the AWSMachinePool reconciler may do that, check in my code review)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not recall being spammed with errors, but I last tested this cod in a cluster 6 months ago.

exp/api/v1beta2/conditions_consts.go Outdated Show resolved Hide resolved
exp/controllers/awsmachinepool_controller.go Outdated Show resolved Hide resolved
exp/controllers/awsmachinepool_controller.go Outdated Show resolved Hide resolved
exp/controllers/awsmachinepool_controller.go Outdated Show resolved Hide resolved
},
// Note: this AWSMachine will be owned by the MachinePool until the MachinePool controller
// creates its parent Machine which will adopt this resource and replace the owner reference.
// We set the MachinePool as a temporary owner to prevent this from becoming an orphan resource.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to set BlockOwnerDeletion: true in the owner reference?

In CAPZ, that is done, and they're using AzureMachinePool, not MachinePool as the owner. I'm wondering which is better. Having CAPI's MachinePool controller own an AWS* resource seems wrong at first sight, since it won't delete the referenced object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: BlockOwnerDeletion I am not sure. I didn't set it bc this is supposed to be temporary ownership.

I'm not familiar with the CAPZ implementation. Do they have AzureMachinePool act as a long-lived owner, or does the ownership get passed to the Machine resource? I don't have a strong opinion here, I was trying to follow the specification as closely as possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not worth the effort until we see issues.

CAPZ code reference – I'm also not famliiar with it.

Regarding this PR, is our temporary owner reference deleted anywhere? Or do we have both our reference and CAPI's Machine as owners? Once I grasp the concept here, I can maybe look through CAPZ or ask how it's done there, and why.

exp/controllers/awsmachinepool_controller.go Outdated Show resolved Hide resolved
exp/controllers/awsmachinepool_controller.go Outdated Show resolved Hide resolved
exp/controllers/awsmachinepool_controller.go Outdated Show resolved Hide resolved
exp/controllers/awsmachinepool_controller.go Show resolved Hide resolved
@AndiDog
Copy link
Contributor

AndiDog commented Jul 16, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 16, 2024
@cnmcavoy cnmcavoy force-pushed the cnmcavoy/awsmachinepool-awsmachines branch from cc16742 to 032ac8a Compare July 22, 2024 19:18
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign richardcase for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cnmcavoy cnmcavoy force-pushed the cnmcavoy/awsmachinepool-awsmachines branch 2 times, most recently from 35b3503 to 2c63fbb Compare July 22, 2024 19:37
@cnmcavoy cnmcavoy requested a review from AndiDog July 22, 2024 20:44
},
// Note: this AWSMachine will be owned by the MachinePool until the MachinePool controller
// creates its parent Machine which will adopt this resource and replace the owner reference.
// We set the MachinePool as a temporary owner to prevent this from becoming an orphan resource.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not worth the effort until we see issues.

CAPZ code reference – I'm also not famliiar with it.

Regarding this PR, is our temporary owner reference deleted anywhere? Or do we have both our reference and CAPI's Machine as owners? Once I grasp the concept here, I can maybe look through CAPZ or ask how it's done there, and why.

exp/controllers/awsmachinepool_controller.go Outdated Show resolved Hide resolved
@cnmcavoy cnmcavoy force-pushed the cnmcavoy/awsmachinepool-awsmachines branch from 2c63fbb to 4b0964f Compare July 26, 2024 19:35
@AndiDog
Copy link
Contributor

AndiDog commented Aug 13, 2024

As an update: I'll try to test this major feature. PR looks mostly fine, I think. I'm only unsure about the owner reference in the non-happy case.

@AndiDog
Copy link
Contributor

AndiDog commented Aug 22, 2024

I got a working test environment, and the AWSMachine/Machine objects get created correctly 🍀.

First blocking issue I found: deletion by downscaling the ASG/AWSMachinePool leads to this permanent error rather than deleting the AWSMachine:

  failureMessage: EC2 instance state "terminated" is unexpected
  failureReason: UpdateError
  instanceState: terminated
  ready: false

The Machine/AWSMachine combo also can't be deleted due to:

capa_control… │ I0822 16:04:49.507270       1 awsmachine_controller.go:303] "Handling deleted AWSMachine"
capa_control… │ E0822 16:04:49.508449       1 awsmachine_controller.go:308] "unable to delete machine" err="failed to get raw userdata: error retrieving bootstrap data: linked Machine's bootstrap.dataSecretName is nil"

Other than that, the Node object went away and the cluster reacted according to the node deletion. @cnmcavoy what happened when you tested this? And which other cases do you think should be tested? I guess cluster-autoscaler scaling of ASGs may be a good external trigger that CAPA should have no problem with – I can test that.

@cnmcavoy
Copy link
Contributor Author

@AndiDog thanks for taking a look. I think that's slightly concerning bc there should be test cases covering that sort of condition.

I suspect this PR has been left too long and the code has rotted. Indeed doesn't use ASGs anymore for autoscaling, so I can't really justify the time investment it would take to cut a new PR and start this fresh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. needs-priority ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement MachinePool Machines clusterAPI proposal
7 participants