Skip to content

Conversation

@ryankert01
Copy link
Member

@ryankert01 ryankert01 commented Jan 6, 2026

Description

Addresses a critical issue in the DefaultAutoscalerV2, where nodes were not being properly scaled from zero. With this update, clusters managed by Ray will now automatically provision additional nodes when there is workload demand, even when starting from an idle (zero-node) state.

Related issues

Closes #59682

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a crucial feature for DefaultAutoscalerV2 by enabling it to scale up from zero worker nodes. The approach of discovering node types from the cluster configuration is sound, and the fallback to using only alive nodes ensures robustness. The new tests are comprehensive, covering scaling from zero and edge cases like node types with max_count=0.

My review includes a couple of suggestions to improve the readability and maintainability of the test code by reducing duplication and using existing helper methods. Overall, this is a great addition.

@ryankert01 ryankert01 changed the title [WIP] [Data] DefaultAutoscalerV2 doesn't scale nodes from zero [Data] DefaultAutoscalerV2 doesn't scale nodes from zero Jan 8, 2026
Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

mem=node_group_config.resources.get("memory", 0),
)
nodes_resource_spec_count[node_resource_spec] = 0
except Exception as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do we expect to get exceptions here? Is there a more specific exception type we should catch?

If we do a bare exception, we'll mask valid bugs.

Errors should never pass silently.
Unless explicitly silenced.

Also goes against Python convention: https://peps.python.org/pep-0020/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the error we may get here is RaySystemError. For this I think it's better to remove the try...except block here to let system raise the error directly as this error means ray is not initialized or GCS cannot be connected

https://github.com/ray-project/ray/blob/5122b9de6d45a9a034c4e03276420fb1ad9df736/python/ray/_private/state.py#L38-L60C38

Comment on lines 133 to 137
# Skip if no resources or max_count=0 (cannot scale)
if (
not node_group_config.resources
or node_group_config.max_count == 0
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happen if there are node groups where min_count=max_count? How would the autoscaler behave?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Node groups with min_count=max_count (fixed size) are not skipped because the autoscaler needs to know about all node types when making scaling decisions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Node groups with min_count=max_count (fixed size) are not skipped because the autoscaler needs to know about all node types when making scaling decisions.

What happens if the autoscaler doesn't know about the fixed-size node groups? I think in the current implementation, the Ray Data autoscaler will attempt to scale up fixed-size node groups even though that's not possible

Comment on lines 127 to 133
# Patch cluster config to return None
with patch("ray.nodes", return_value=node_table):
assert autoscaler._get_node_resource_spec_and_count() == expected
with patch(
"ray._private.state.state.get_cluster_config",
return_value=None,
):
assert autoscaler._get_node_resource_spec_and_count() == expected
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a heads up -- I think this might conflict with #59933

"ray._private.state.state.get_cluster_config",
return_value=cluster_config,
):
result = autoscaler._get_node_resource_spec_and_count()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid testing against internal methods, I think we should make this a public (no dunder) utility function. It'd also simplify the test because then we don't have to fake the autoscaler

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will have to patch ray.nodes() and get_cluster_config() if I want to remove the autoscaler(which is a little complex). Wonder if there's some magic function already available.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. Don't we already patch ray.nodes() and get_cluster_config() in the tests?

What happens if we do something like this?

    def test_get_node_resource_spec_and_count_skips_max_count_zero(self):
        """Test that node types with max_count=0 are skipped."""
-        autoscaler = DefaultClusterAutoscalerV2(
-            resource_manager=MagicMock(),
-            execution_id="test_execution_id",
-        )

        # Simulate a cluster with only head node (no worker nodes)
        node_table = [
            {
                "Resources": self._head_node,
                "Alive": True,
            },
        ]

        # Create a mock cluster config with one valid node type and one with max_count=0
        cluster_config = autoscaler_pb2.ClusterConfig()

        # Node type 1: 4 CPU, 0 GPU, 1000 memory, max_count=10
        node_group_config1 = autoscaler_pb2.NodeGroupConfig()
        node_group_config1.resources["CPU"] = 4
        node_group_config1.resources["memory"] = 1000
        node_group_config1.max_count = 10
        cluster_config.node_group_configs.append(node_group_config1)

        # Node type 2: 8 CPU, 2 GPU, 2000 memory, max_count=0 (should be skipped)
        node_group_config2 = autoscaler_pb2.NodeGroupConfig()
        node_group_config2.resources["CPU"] = 8
        node_group_config2.resources["GPU"] = 2
        node_group_config2.resources["memory"] = 2000
        node_group_config2.max_count = 0  # This should be skipped
        cluster_config.node_group_configs.append(node_group_config2)

        # Only the first node type should be discovered
        expected = {
            _NodeResourceSpec.of(cpu=4, gpu=0, mem=1000): 0,
        }

        with patch("ray.nodes", return_value=node_table):
            with patch(
                "ray._private.state.state.get_cluster_config",
                return_value=cluster_config,
            ):
-               result = autoscaler.get_node_resource_spec_and_count()
+               result = get_node_resource_spec_and_count()
                assert result == expected

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried removing autoscaler from another function first. (test_try_scale_up_cluster) Thats why I reach this wrong conclusion.
Will update it asap

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, got it. Maybe let's wait for #59933 to land before doing that specific follow-up, since I think the changes are related

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree!

@ryankert01 ryankert01 marked this pull request as ready for review January 12, 2026 11:21
@ryankert01 ryankert01 requested a review from a team as a code owner January 12, 2026 11:21
@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core data Ray Data-related issues community-contribution Contributed by the community labels Jan 12, 2026
Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>
@ryankert01 ryankert01 force-pushed the scale-nodes-from-zero-for-default-auto-scaler-v2 branch from 030ebfc to cafa9ae Compare January 12, 2026 16:09
@ryankert01
Copy link
Member Author

Thanks for the review!
PTAL @bveeramani @machichima

for node_group_config in cluster_config.node_group_configs:
# Skip if no resources or max_count=0 (cannot scale)
if not node_group_config.resources or node_group_config.max_count == 0:
continue
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Autoscaler attempts to scale fixed-size node groups

Medium Severity

The code only skips node groups with max_count == 0 but doesn't handle fixed-size node groups where min_count == max_count. When min_count equals max_count, the node group has a fixed size and cannot be scaled up. Including these groups causes the autoscaler to request scaling for node types that can't actually be scaled, resulting in wasteful and incorrect requests. The min_count field from NodeGroupConfig should be compared with max_count to identify and skip fixed-size groups.

Fix in Cursor Fix in Web

Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for these two nits.

Going to merge since this is an incremental improvement. I think okay to address comments in follow-up

@bveeramani bveeramani enabled auto-merge (squash) January 12, 2026 19:14
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Jan 12, 2026
@bveeramani bveeramani merged commit 0dfde3b into ray-project:master Jan 12, 2026
7 of 8 checks passed
AYou0207 pushed a commit to AYou0207/ray that referenced this pull request Jan 13, 2026
…#59896)

## Description

Addresses a critical issue in the `DefaultAutoscalerV2`, where nodes
were not being properly scaled from zero. With this update, clusters
managed by Ray will now automatically provision additional nodes when
there is workload demand, even when starting from an idle (zero-node)
state.

## Related issues
Closes ray-project#59682

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
rushikeshadhav pushed a commit to rushikeshadhav/ray that referenced this pull request Jan 14, 2026
…#59896)

## Description

Addresses a critical issue in the `DefaultAutoscalerV2`, where nodes
were not being properly scaled from zero. With this update, clusters
managed by Ray will now automatically provision additional nodes when
there is workload demand, even when starting from an idle (zero-node)
state.

## Related issues
Closes ray-project#59682


## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] DefaultAutoscalerV2 doesn't scale nodes from zero

3 participants