Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-write parts of AEP-4016 (VPA with In-Place Updates) to align with the new In-Place Updates KEP changes #7813

Merged
merged 14 commits into from
Feb 21, 2025

Conversation

raywainman
Copy link
Contributor

What type of PR is this?

/kind documentation
/kind feature

What this PR does / why we need it:

Many changes have happened to the In-Place Updates KEP since 2023, we need to refresh AEP-4016 to align with the new changes so that we can proceed with implementation.

Which issue(s) this PR fixes:

Part of #7673

Does this PR introduce a user-facing change?

NONE - internal documentation only

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

All information linked in the AEP.

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/documentation Categorizes issue or PR as related to documentation. kind/feature Categorizes issue or PR as related to a new feature. area/vertical-pod-autoscaler labels Feb 7, 2025
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 7, 2025
@raywainman raywainman marked this pull request as ready for review February 7, 2025 16:32
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 7, 2025
@raywainman
Copy link
Contributor Author

/assign maxcao13

@k8s-ci-robot
Copy link
Contributor

@raywainman: GitHub didn't allow me to assign the following users: maxcao13.

Note that only kubernetes members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign maxcao13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@raywainman
Copy link
Contributor Author

cc @maxcao13

@raywainman
Copy link
Contributor Author

@maxcao13 We can consider cutting the "disruption-less" updates as part of our initial MVP. This essentially means that we have the current Auto mode + in-place updates.

Would love your thoughts :)

@raywainman
Copy link
Contributor Author

/assign @laoj2

@k8s-ci-robot
Copy link
Contributor

@raywainman: GitHub didn't allow me to assign the following users: laoj2.

Note that only kubernetes members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @laoj2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@raywainman
Copy link
Contributor Author

cc @jpawelczak
cc @laoj2

Copy link
Contributor

@maxcao13 maxcao13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this up! Some comments and questions.

Comment on lines 136 to 138
In the `InPlaceOrRecreate` modes VPA updater will attempt to apply updates that require container
restart in place. It will update them under the same conditions that would trigger update with
`Recreate` mode. That is it will apply disruptive updates if:
Copy link
Contributor

@maxcao13 maxcao13 Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say the updater should only attempt to apply these "disruptive" updates if the container restart resize policy is RestartContainer. VPA shouldn't actively try to do this otherwise. At least that's what I understood from the previous AEP description.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean: if the container restart resize policy is RestartContainer, instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! My mistake, I've edited the original comment,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed on Slack, I don't think VPA needs to care about the resize policy, I think it should simply attempt the resize operation if the user set it that way and let the rest of the Kubernetes machinery take care of actually applying the update.

Does that sound OK?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a small section about the interaction between VPA and the ResizePolicy setting, hopefully that clarifies things a bit?

Comment on lines 136 to 138
In the `InPlaceOrRecreate` modes VPA updater will attempt to apply updates that require container
restart in place. It will update them under the same conditions that would trigger update with
`Recreate` mode. That is it will apply disruptive updates if:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean: if the container restart resize policy is RestartContainer, instead?

### 1. Applying Updates During Pod Admission

For VPAs using the new `InPlaceOrRecreate` mode, the VPA Admission Controller will apply updates to
starting pods just as it does for VPAs in `Initial`, `Auto`, and `Recreate` modes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure if this AEP is the correct place to make this comment, but the promise of Auto was that it would one day support in-place, see https://github.com/kubernetes/autoscaler/blob/78c8173b979316f892327022d53369760b000210/vertical-pod-autoscaler/docs/api.md#updatemode

We've been promising users that if they use Auto today, one day in the future the VPA will start resizing their pods for them without recreating them.
It seems that this may not be the case since the container runtime can't guarantee it.

Do we need to start thinking about what to do with Auto in the future?

Again, I'm unsure if this really fits into the context of this AEP, but I thought I'd bring it up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2 cents on this are that, switching Auto to be InPlace or some variation of InPlace would be a breaking change that should be in something like VPA 2.0. It may not "break" someone's configuration, but the change probably won't be the same expected behaviour in VPA 0.x and 1.x when the updater was freely evicting pods in Auto mode. Although, what I'm talking about is probably a discussion for the future timeline of VPA in general.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the behavior of Auto, good idea. Major version bump to accompany that, also a good idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the discussion around changing the behavior of auto, we're talking about a future where in-place vertical scaling is GA and not under a feature gate, right? IMHO it would feel a bit weird to make the default behavior of VPA depend on a beta feature gate.

Copy link
Member

@adrianmoisey adrianmoisey Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the discussion around changing the behavior of auto, we're talking about a future where in-place vertical scaling is GA and not under a feature gate, right?

Correct. Since the work on in-place is starting to happen now, and we are potentially diverging on the promise given to users, it seems appropriate to discuss Auto now, just so it's considered in our discussions as we move forward with the in-place rollout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specified in the AEP that the intention is that this eventually graduates into the Auto mode.

@raywainman
Copy link
Contributor Author

Thanks all! I've applied changes based on the feedback and some discussions we've on Slack.

Would love another look when folks have some time, thanks again!

Copy link
Contributor

@voelzmo voelzmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Not instantly approving to give others a chance for a second look.

@adrianmoisey
Copy link
Member

adrianmoisey commented Feb 20, 2025

/hold

EDIT: Oh, hold is already on this PR

Copy link
Member

@omerap12 omerap12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks @raywainman!

* If `CanEvict` is false.
* If any of the `EvictionRequirements` on the VPA are not true.

These additional resizes can be attempted because the eviction fallback would fail anyway.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to clarify this...
So an in-place update will be attempted, if that in-place fails, the VPA will not attempt an eviction?
Almost as if it was InPlaceOrNotAtAll ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will attempt an eviction but the eviction won't go through because the actual condition will prevent it from happening. Let me clarify that in the text.

Comment on lines 155 to 158
These additional resizes can be attempted because the eviction fallback would fail anyway.

The VPA updater will evict a pod to actuate a recommendation if it attempted to apply the
recommendation in place and failed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two sentences seem to conflict with each other

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified.

The idea here is that we can be slightly more relaxed with in-place resizes and still allow VPA to proceed with eviction when the operation fails. The eviction would get blocked anyway since these conditions themselves prevent eviction from happening. Hopefully that makes sense.

@adrianmoisey
Copy link
Member

The content /lgtm

I left some mostly minor comments, and one clarification on wording

@laoj2
Copy link
Contributor

laoj2 commented Feb 20, 2025

/lgtm

Looks great, thanks @raywainman!

@k8s-ci-robot
Copy link
Contributor

@laoj2: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Looks great, thanks @raywainman!

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

@maxcao13 maxcao13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great and clears up much of my concerns with the previous iteration. Thanks so much @raywainman. I think we'll be able to close #7722 from this too.

@raywainman
Copy link
Contributor Author

cc @tallclair

@adrianmoisey
Copy link
Member

/lgtm

Thank you!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 21, 2025
@omerap12
Copy link
Member

/lgtm for me too

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: omerap12, raywainman, voelzmo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@raywainman
Copy link
Contributor Author

Thanks all! Will merge this now. We can make further updates in new PRs as needed.

@raywainman
Copy link
Contributor Author

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 21, 2025
@k8s-ci-robot k8s-ci-robot merged commit 0a34bf5 into kubernetes:master Feb 21, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/vertical-pod-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/documentation Categorizes issue or PR as related to documentation. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants