-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Make maxNodeStartupTime configurable #8543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Make maxNodeStartupTime configurable #8543
Conversation
Welcome @lxuan94-pp! |
Hi @lxuan94-pp. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: lxuan94-pp The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like a reasonable flag to propose, i'd like a little time to study the clusterapi specific changes.
/test pull-cluster-autoscaler-e2e-azure-master |
Thanks @elmiko and @jackfrancis for the reviews, please let me know if you have further concerns. |
func (p *DelegatingNodeGroupConfigProcessor) GetMaxNodeStartupTime(nodeGroup cloudprovider.NodeGroup) (time.Duration, error) { | ||
ngConfig, err := nodeGroup.GetOptions(p.nodeGroupDefaults) | ||
if err != nil && err != cloudprovider.ErrNotImplemented { | ||
return 15 * time.Minute, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to return p.nodeGroupDefaults.MaxNodeStartupTime
here as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated.
// Create CSR with unhealthy cluster protection effectively disabled, to guarantee we reach the tested logic. | ||
csrConfig := clusterstate.ClusterStateRegistryConfig{OkTotalUnreadyCount: nodeGroupCount * unreadyNodesCount} | ||
csr := clusterstate.NewClusterStateRegistry(provider, csrConfig, ctx.LogRecorder, NewBackoff(), nodegroupconfig.NewDefaultNodeGroupConfigProcessor(config.NodeGroupAutoscalingOptions{MaxNodeProvisionTime: 15 * time.Minute}), processors.AsyncNodeGroupStateChecker) | ||
csr := clusterstate.NewClusterStateRegistry(provider, csrConfig, ctx.LogRecorder, NewBackoff(), nodegroupconfig.NewDefaultNodeGroupConfigProcessor(config.NodeGroupAutoscalingOptions{MaxNodeProvisionTime: 15 * time.Minute, MaxNodeStartupTime: 15 * time.Minute}), processors.AsyncNodeGroupStateChecker) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merge conflict pro-tip: when you rebase you'll notice that the variable returned from NewScaleTestAutoscalingContext
has been changed to autoscalingCtx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the heads up, rebased and reran all UTs. Found a small bug in snapshot_test that is unrelated to my change, fixed tht too
autoscaler/simulator/dynamicresources/snapshot] simulator/dynamicresources/snapshot/snapshot_test.go:636:4: (*testing.common).Fatalf format %s has arg addedNodeSlice.Spec.NodeName of wrong type *string
189d25b
to
5b573a6
Compare
addedSlices := []*resourceapi.ResourceSlice{addedNodeSlice.DeepCopy()} | ||
if err := s.AddNodeResourceSlices(*addedNodeSlice.Spec.NodeName, addedSlices); err != nil { | ||
t.Fatalf("failed to add %s resource slices: %v", addedNodeSlice.Spec.NodeName, err) | ||
t.Fatalf("failed to add %s resource slices: %v", *addedNodeSlice.Spec.NodeName, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what explains this change?
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR introduces a new configurable flag --max-node-startup-time to the Cluster Autoscaler, which is currently hard-coded to 15 minutes in clusterstate.go. Similar to the existing --max-node-provision-time flag, this allows users to configure the maximum time the autoscaler should wait for a node to transition from registered to ready, providing more flexibility in environments where node readiness may vary due to image size, initialization scripts, or network conditions.
Which issue(s) this PR fixes:
NA
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Yes
If yes, a release note is required:
Enter your extended release note in the block below. If the PR requires additional action from users switching to the new release, include the string "action required".
For more information on release notes see: https://git.k8s.io/community/contributors/guide/release-notes.md
-->
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: