Make maxNodeStartupTime configurable #8543

lxuan94-pp · 2025-09-16T23:55:30Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR introduces a new configurable flag --max-node-startup-time to the Cluster Autoscaler, which is currently hard-coded to 15 minutes in clusterstate.go. Similar to the existing --max-node-provision-time flag, this allows users to configure the maximum time the autoscaler should wait for a node to transition from registered to ready, providing more flexibility in environments where node readiness may vary due to image size, initialization scripts, or network conditions.

Which issue(s) this PR fixes:

NA

Special notes for your reviewer:

The new flag is consistent with existing time-based configuration flags.
Default behavior remains unchanged unless the new flag is set(fallback to 15mins hard-coded value if not set).
Helps operators tune autoscaler behavior for different environments
This feature has been tested on an OCI provider cluster

Does this PR introduce a user-facing change?

Yes
If yes, a release note is required:
Enter your extended release note in the block below. If the PR requires additional action from users switching to the new release, include the string "action required".

For more information on release notes see: https://git.k8s.io/community/contributors/guide/release-notes.md
-->

Cluster Autoscaler now supports a new flag `--max-node-startup-time` to configure the maximum wait time for a node to become ready.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

linux-foundation-easycla · 2025-09-16T23:55:36Z

The committers listed above are authorized under a signed CLA.

✅ login: lxuan94-pp / name: Xuan Liu (2f60c03, 5b573a6)

k8s-ci-robot · 2025-09-16T23:55:38Z

Welcome @lxuan94-pp!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2025-09-16T23:55:39Z

Hi @lxuan94-pp. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-09-16T23:55:40Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lxuan94-pp
Once this PR has been reviewed and has the lgtm label, please assign aleksandra-malinowska for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

cluster-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

elmiko · 2025-09-23T14:19:24Z

/ok-to-test

elmiko

this seems like a reasonable flag to propose, i'd like a little time to study the clusterapi specific changes.

jackfrancis · 2025-09-24T15:55:24Z

/test pull-cluster-autoscaler-e2e-azure-master

lxuan94-pp · 2025-10-07T14:37:48Z

Thanks @elmiko and @jackfrancis for the reviews, please let me know if you have further concerns.

jackfrancis · 2025-10-07T17:08:13Z

cluster-autoscaler/processors/nodegroupconfig/node_group_config_processor.go

+func (p *DelegatingNodeGroupConfigProcessor) GetMaxNodeStartupTime(nodeGroup cloudprovider.NodeGroup) (time.Duration, error) {
+	ngConfig, err := nodeGroup.GetOptions(p.nodeGroupDefaults)
+	if err != nil && err != cloudprovider.ErrNotImplemented {
+		return 15 * time.Minute, err


I think we want to return p.nodeGroupDefaults.MaxNodeStartupTime here as well

Thanks, updated.

jackfrancis · 2025-10-07T18:24:28Z

cluster-autoscaler/core/static_autoscaler_test.go

 	// Create CSR with unhealthy cluster protection effectively disabled, to guarantee we reach the tested logic.
 	csrConfig := clusterstate.ClusterStateRegistryConfig{OkTotalUnreadyCount: nodeGroupCount * unreadyNodesCount}
-	csr := clusterstate.NewClusterStateRegistry(provider, csrConfig, ctx.LogRecorder, NewBackoff(), nodegroupconfig.NewDefaultNodeGroupConfigProcessor(config.NodeGroupAutoscalingOptions{MaxNodeProvisionTime: 15 * time.Minute}), processors.AsyncNodeGroupStateChecker)
+	csr := clusterstate.NewClusterStateRegistry(provider, csrConfig, ctx.LogRecorder, NewBackoff(), nodegroupconfig.NewDefaultNodeGroupConfigProcessor(config.NodeGroupAutoscalingOptions{MaxNodeProvisionTime: 15 * time.Minute, MaxNodeStartupTime: 15 * time.Minute}), processors.AsyncNodeGroupStateChecker)


merge conflict pro-tip: when you rebase you'll notice that the variable returned from NewScaleTestAutoscalingContext has been changed to autoscalingCtx

Thanks for the heads up, rebased and reran all UTs. Found a small bug in snapshot_test that is unrelated to my change, fixed tht too
autoscaler/simulator/dynamicresources/snapshot] simulator/dynamicresources/snapshot/snapshot_test.go:636:4: (*testing.common).Fatalf format %s has arg addedNodeSlice.Spec.NodeName of wrong type *string

jackfrancis · 2025-10-10T19:26:43Z

cluster-autoscaler/simulator/dynamicresources/snapshot/snapshot_test.go

 		addedSlices := []*resourceapi.ResourceSlice{addedNodeSlice.DeepCopy()}
 		if err := s.AddNodeResourceSlices(*addedNodeSlice.Spec.NodeName, addedSlices); err != nil {
-			t.Fatalf("failed to add %s resource slices: %v", addedNodeSlice.Spec.NodeName, err)
+			t.Fatalf("failed to add %s resource slices: %v", *addedNodeSlice.Spec.NodeName, err)


what explains this change?

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. labels Sep 16, 2025

k8s-ci-robot added the do-not-merge/needs-area label Sep 16, 2025

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Sep 16, 2025

k8s-ci-robot requested review from elmiko and hardikdr September 16, 2025 23:55

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 16, 2025

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 23, 2025

elmiko reviewed Sep 23, 2025

View reviewed changes

jackfrancis reviewed Oct 7, 2025

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 7, 2025

jackfrancis reviewed Oct 7, 2025

View reviewed changes

lxuan94-pp added 2 commits October 8, 2025 11:55

make maxNodeStartupTime configurable

2f60c03

Add Unit Tests

5b573a6

lxuan94-pp force-pushed the xualiliu/oci-maxNodeStartupTime branch from 189d25b to 5b573a6 Compare October 8, 2025 16:16

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 8, 2025

jackfrancis reviewed Oct 10, 2025

View reviewed changes

Make maxNodeStartupTime configurable #8543

Are you sure you want to change the base?

Make maxNodeStartupTime configurable #8543

Uh oh!

Conversation

lxuan94-pp commented Sep 16, 2025

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

linux-foundation-easycla bot commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Sep 16, 2025

Uh oh!

k8s-ci-robot commented Sep 16, 2025

Uh oh!

k8s-ci-robot commented Sep 16, 2025

Uh oh!

elmiko commented Sep 23, 2025

Uh oh!

elmiko left a comment

Choose a reason for hiding this comment

Uh oh!

jackfrancis commented Sep 24, 2025

Uh oh!

lxuan94-pp commented Oct 7, 2025

Uh oh!

jackfrancis Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

lxuan94-pp Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

jackfrancis Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

lxuan94-pp Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

jackfrancis Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

linux-foundation-easycla bot commented Sep 16, 2025 •

edited

Loading