[Data] DefaultAutoscalerV2 doesn't scale nodes from zero #59896

ryankert01 · 2026-01-06T14:09:06Z

Description

Addresses a critical issue in the DefaultAutoscalerV2, where nodes were not being properly scaled from zero. With this update, clusters managed by Ray will now automatically provision additional nodes when there is workload demand, even when starting from an idle (zero-node) state.

Related issues

Closes #59682

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>

…ler-v2

gemini-code-assist

Code Review

This pull request implements a crucial feature for DefaultAutoscalerV2 by enabling it to scale up from zero worker nodes. The approach of discovering node types from the cluster configuration is sound, and the fallback to using only alive nodes ensures robustness. The new tests are comprehensive, covering scaling from zero and edge cases like node types with max_count=0.

My review includes a couple of suggestions to improve the readability and maintainability of the test code by reducing duplication and using existing helper methods. Overall, this is a great addition.

python/ray/data/tests/test_default_cluster_autoscaler_v2.py

…ler-v2

bveeramani

Overall LGTM

bveeramani · 2026-01-09T01:40:50Z

python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py

+                        mem=node_group_config.resources.get("memory", 0),
+                    )
+                    nodes_resource_spec_count[node_resource_spec] = 0
+        except Exception as e:


When do we expect to get exceptions here? Is there a more specific exception type we should catch?

If we do a bare exception, we'll mask valid bugs.

Errors should never pass silently.
Unless explicitly silenced.

Also goes against Python convention: https://peps.python.org/pep-0020/

I think the error we may get here is RaySystemError. For this I think it's better to remove the try...except block here to let system raise the error directly as this error means ray is not initialized or GCS cannot be connected

https://github.com/ray-project/ray/blob/5122b9de6d45a9a034c4e03276420fb1ad9df736/python/ray/_private/state.py#L38-L60C38

bveeramani · 2026-01-09T01:42:26Z

python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py

+                    # Skip if no resources or max_count=0 (cannot scale)
+                    if (
+                        not node_group_config.resources
+                        or node_group_config.max_count == 0
+                    ):


What happen if there are node groups where min_count=max_count? How would the autoscaler behave?

Node groups with min_count=max_count (fixed size) are not skipped because the autoscaler needs to know about all node types when making scaling decisions.

Node groups with min_count=max_count (fixed size) are not skipped because the autoscaler needs to know about all node types when making scaling decisions.

What happens if the autoscaler doesn't know about the fixed-size node groups? I think in the current implementation, the Ray Data autoscaler will attempt to scale up fixed-size node groups even though that's not possible

bveeramani · 2026-01-09T01:45:27Z

python/ray/data/tests/test_default_cluster_autoscaler_v2.py

+        # Patch cluster config to return None
        with patch("ray.nodes", return_value=node_table):
-            assert autoscaler._get_node_resource_spec_and_count() == expected
+            with patch(
+                "ray._private.state.state.get_cluster_config",
+                return_value=None,
+            ):
+                assert autoscaler._get_node_resource_spec_and_count() == expected


Just a heads up -- I think this might conflict with #59933

bveeramani · 2026-01-09T01:47:40Z

python/ray/data/tests/test_default_cluster_autoscaler_v2.py

+                "ray._private.state.state.get_cluster_config",
+                return_value=cluster_config,
+            ):
+                result = autoscaler._get_node_resource_spec_and_count()


To avoid testing against internal methods, I think we should make this a public (no dunder) utility function. It'd also simplify the test because then we don't have to fake the autoscaler

I will have to patch ray.nodes() and get_cluster_config() if I want to remove the autoscaler(which is a little complex). Wonder if there's some magic function already available.

I'm confused. Don't we already patch ray.nodes() and get_cluster_config() in the tests?

What happens if we do something like this?

def test_get_node_resource_spec_and_count_skips_max_count_zero(self): """Test that node types with max_count=0 are skipped.""" - autoscaler = DefaultClusterAutoscalerV2( - resource_manager=MagicMock(), - execution_id="test_execution_id", - ) # Simulate a cluster with only head node (no worker nodes) node_table = [ { "Resources": self._head_node, "Alive": True, }, ] # Create a mock cluster config with one valid node type and one with max_count=0 cluster_config = autoscaler_pb2.ClusterConfig() # Node type 1: 4 CPU, 0 GPU, 1000 memory, max_count=10 node_group_config1 = autoscaler_pb2.NodeGroupConfig() node_group_config1.resources["CPU"] = 4 node_group_config1.resources["memory"] = 1000 node_group_config1.max_count = 10 cluster_config.node_group_configs.append(node_group_config1) # Node type 2: 8 CPU, 2 GPU, 2000 memory, max_count=0 (should be skipped) node_group_config2 = autoscaler_pb2.NodeGroupConfig() node_group_config2.resources["CPU"] = 8 node_group_config2.resources["GPU"] = 2 node_group_config2.resources["memory"] = 2000 node_group_config2.max_count = 0 # This should be skipped cluster_config.node_group_configs.append(node_group_config2) # Only the first node type should be discovered expected = { _NodeResourceSpec.of(cpu=4, gpu=0, mem=1000): 0, } with patch("ray.nodes", return_value=node_table): with patch( "ray._private.state.state.get_cluster_config", return_value=cluster_config, ): - result = autoscaler.get_node_resource_spec_and_count() + result = get_node_resource_spec_and_count() assert result == expected

I tried removing autoscaler from another function first. (test_try_scale_up_cluster) Thats why I reach this wrong conclusion.
Will update it asap

Oh, got it. Maybe let's wait for #59933 to land before doing that specific follow-up, since I think the changes are related

…ler-v2

python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py

Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>

ryankert01 · 2026-01-12T16:24:18Z

Thanks for the review!
PTAL @bveeramani @machichima

…ler-v2

cursor · 2026-01-12T19:13:48Z

python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py

+            for node_group_config in cluster_config.node_group_configs:
+                # Skip if no resources or max_count=0 (cannot scale)
+                if not node_group_config.resources or node_group_config.max_count == 0:
+                    continue


Autoscaler attempts to scale fixed-size node groups

Medium Severity

The code only skips node groups with max_count == 0 but doesn't handle fixed-size node groups where min_count == max_count. When min_count equals max_count, the node group has a fixed size and cannot be scaled up. Including these groups causes the autoscaler to request scaling for node types that can't actually be scaled, resulting in wasteful and incorrect requests. The min_count field from NodeGroupConfig should be compared with max_count to identify and skip fixed-size groups.

bveeramani

LGTM except for these two nits.

Going to merge since this is an incremental improvement. I think okay to address comments in follow-up

…#59896) ## Description Addresses a critical issue in the `DefaultAutoscalerV2`, where nodes were not being properly scaled from zero. With this update, clusters managed by Ray will now automatically provision additional nodes when there is workload demand, even when starting from an idle (zero-node) state. ## Related issues Closes ray-project#59682 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>

…#59896) ## Description Addresses a critical issue in the `DefaultAutoscalerV2`, where nodes were not being properly scaled from zero. With this update, clusters managed by Ray will now automatically provision additional nodes when there is workload demand, even when starting from an idle (zero-node) state. ## Related issues Closes ray-project#59682 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>

ryankert01 added 2 commits January 6, 2026 14:07

[Data] DefaultAutoscalerV2 doesn't scale nodes from zero

5accb8f

Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>

Merge branch 'master' into scale-nodes-from-zero-for-default-auto-sca…

d266921

…ler-v2

gemini-code-assist bot reviewed Jan 6, 2026

View reviewed changes

python/ray/data/tests/test_default_cluster_autoscaler_v2.py Show resolved Hide resolved

python/ray/data/tests/test_default_cluster_autoscaler_v2.py Show resolved Hide resolved

Merge branch 'master' into scale-nodes-from-zero-for-default-auto-sca…

5122b9d

…ler-v2

ryankert01 changed the title ~~[WIP] [Data] DefaultAutoscalerV2 doesn't scale nodes from zero~~ [Data] DefaultAutoscalerV2 doesn't scale nodes from zero Jan 8, 2026

bveeramani reviewed Jan 9, 2026

View reviewed changes

Merge branch 'master' into scale-nodes-from-zero-for-default-auto-sca…

614b909

…ler-v2

ryankert01 marked this pull request as ready for review January 12, 2026 11:21

ryankert01 requested a review from a team as a code owner January 12, 2026 11:21

cursor bot reviewed Jan 12, 2026

View reviewed changes

python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py Outdated Show resolved Hide resolved

ray-gardener bot added core Issues that should be addressed in Ray Core data Ray Data-related issues community-contribution Contributed by the community labels Jan 12, 2026

address comments: properly propagate err by remove try catch

cafa9ae

Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>

ryankert01 force-pushed the scale-nodes-from-zero-for-default-auto-scaler-v2 branch from 030ebfc to cafa9ae Compare January 12, 2026 16:09

Merge branch 'master' into scale-nodes-from-zero-for-default-auto-sca…

45087ae

…ler-v2

cursor bot reviewed Jan 12, 2026

View reviewed changes

bveeramani approved these changes Jan 12, 2026

View reviewed changes

bveeramani enabled auto-merge (squash) January 12, 2026 19:14

github-actions bot added the go add ONLY when ready to merge, run all tests label Jan 12, 2026

bveeramani mentioned this pull request Jan 12, 2026

[Data] Introduce seams to DefaultAutoscaler2 to make it more testable #59933

Merged

bveeramani merged commit 0dfde3b into ray-project:master Jan 12, 2026
7 of 8 checks passed

[Data] DefaultAutoscalerV2 doesn't scale nodes from zero #59896

[Data] DefaultAutoscalerV2 doesn't scale nodes from zero #59896

Uh oh!

Conversation

ryankert01 commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

bveeramani left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ryankert01 commented Jan 12, 2026

Uh oh!

cursor bot Jan 12, 2026

Choose a reason for hiding this comment

Autoscaler attempts to scale fixed-size node groups

Uh oh!

bveeramani left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ryankert01 commented Jan 6, 2026 •

edited

Loading