-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[Data] DefaultAutoscalerV2 doesn't scale nodes from zero #59896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] DefaultAutoscalerV2 doesn't scale nodes from zero #59896
Conversation
Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request implements a crucial feature for DefaultAutoscalerV2 by enabling it to scale up from zero worker nodes. The approach of discovering node types from the cluster configuration is sound, and the fallback to using only alive nodes ensures robustness. The new tests are comprehensive, covering scaling from zero and edge cases like node types with max_count=0.
My review includes a couple of suggestions to improve the readability and maintainability of the test code by reducing duplication and using existing helper methods. Overall, this is a great addition.
bveeramani
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM
| mem=node_group_config.resources.get("memory", 0), | ||
| ) | ||
| nodes_resource_spec_count[node_resource_spec] = 0 | ||
| except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When do we expect to get exceptions here? Is there a more specific exception type we should catch?
If we do a bare exception, we'll mask valid bugs.
Errors should never pass silently.
Unless explicitly silenced.
Also goes against Python convention: https://peps.python.org/pep-0020/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the error we may get here is RaySystemError. For this I think it's better to remove the try...except block here to let system raise the error directly as this error means ray is not initialized or GCS cannot be connected
| # Skip if no resources or max_count=0 (cannot scale) | ||
| if ( | ||
| not node_group_config.resources | ||
| or node_group_config.max_count == 0 | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happen if there are node groups where min_count=max_count? How would the autoscaler behave?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Node groups with min_count=max_count (fixed size) are not skipped because the autoscaler needs to know about all node types when making scaling decisions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Node groups with min_count=max_count (fixed size) are not skipped because the autoscaler needs to know about all node types when making scaling decisions.
What happens if the autoscaler doesn't know about the fixed-size node groups? I think in the current implementation, the Ray Data autoscaler will attempt to scale up fixed-size node groups even though that's not possible
| # Patch cluster config to return None | ||
| with patch("ray.nodes", return_value=node_table): | ||
| assert autoscaler._get_node_resource_spec_and_count() == expected | ||
| with patch( | ||
| "ray._private.state.state.get_cluster_config", | ||
| return_value=None, | ||
| ): | ||
| assert autoscaler._get_node_resource_spec_and_count() == expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a heads up -- I think this might conflict with #59933
| "ray._private.state.state.get_cluster_config", | ||
| return_value=cluster_config, | ||
| ): | ||
| result = autoscaler._get_node_resource_spec_and_count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid testing against internal methods, I think we should make this a public (no dunder) utility function. It'd also simplify the test because then we don't have to fake the autoscaler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will have to patch ray.nodes() and get_cluster_config() if I want to remove the autoscaler(which is a little complex). Wonder if there's some magic function already available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. Don't we already patch ray.nodes() and get_cluster_config() in the tests?
What happens if we do something like this?
def test_get_node_resource_spec_and_count_skips_max_count_zero(self):
"""Test that node types with max_count=0 are skipped."""
- autoscaler = DefaultClusterAutoscalerV2(
- resource_manager=MagicMock(),
- execution_id="test_execution_id",
- )
# Simulate a cluster with only head node (no worker nodes)
node_table = [
{
"Resources": self._head_node,
"Alive": True,
},
]
# Create a mock cluster config with one valid node type and one with max_count=0
cluster_config = autoscaler_pb2.ClusterConfig()
# Node type 1: 4 CPU, 0 GPU, 1000 memory, max_count=10
node_group_config1 = autoscaler_pb2.NodeGroupConfig()
node_group_config1.resources["CPU"] = 4
node_group_config1.resources["memory"] = 1000
node_group_config1.max_count = 10
cluster_config.node_group_configs.append(node_group_config1)
# Node type 2: 8 CPU, 2 GPU, 2000 memory, max_count=0 (should be skipped)
node_group_config2 = autoscaler_pb2.NodeGroupConfig()
node_group_config2.resources["CPU"] = 8
node_group_config2.resources["GPU"] = 2
node_group_config2.resources["memory"] = 2000
node_group_config2.max_count = 0 # This should be skipped
cluster_config.node_group_configs.append(node_group_config2)
# Only the first node type should be discovered
expected = {
_NodeResourceSpec.of(cpu=4, gpu=0, mem=1000): 0,
}
with patch("ray.nodes", return_value=node_table):
with patch(
"ray._private.state.state.get_cluster_config",
return_value=cluster_config,
):
- result = autoscaler.get_node_resource_spec_and_count()
+ result = get_node_resource_spec_and_count()
assert result == expectedThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried removing autoscaler from another function first. (test_try_scale_up_cluster) Thats why I reach this wrong conclusion.
Will update it asap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, got it. Maybe let's wait for #59933 to land before doing that specific follow-up, since I think the changes are related
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree!
python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com>
030ebfc to
cafa9ae
Compare
|
Thanks for the review! |
| for node_group_config in cluster_config.node_group_configs: | ||
| # Skip if no resources or max_count=0 (cannot scale) | ||
| if not node_group_config.resources or node_group_config.max_count == 0: | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Autoscaler attempts to scale fixed-size node groups
Medium Severity
The code only skips node groups with max_count == 0 but doesn't handle fixed-size node groups where min_count == max_count. When min_count equals max_count, the node group has a fixed size and cannot be scaled up. Including these groups causes the autoscaler to request scaling for node types that can't actually be scaled, resulting in wasteful and incorrect requests. The min_count field from NodeGroupConfig should be compared with max_count to identify and skip fixed-size groups.
bveeramani
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for these two nits.
Going to merge since this is an incremental improvement. I think okay to address comments in follow-up
…#59896) ## Description Addresses a critical issue in the `DefaultAutoscalerV2`, where nodes were not being properly scaled from zero. With this update, clusters managed by Ray will now automatically provision additional nodes when there is workload demand, even when starting from an idle (zero-node) state. ## Related issues Closes ray-project#59682 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
…#59896) ## Description Addresses a critical issue in the `DefaultAutoscalerV2`, where nodes were not being properly scaled from zero. With this update, clusters managed by Ray will now automatically provision additional nodes when there is workload demand, even when starting from an idle (zero-node) state. ## Related issues Closes ray-project#59682 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Hsien-Cheng Huang <ryankert01@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Description
Addresses a critical issue in the
DefaultAutoscalerV2, where nodes were not being properly scaled from zero. With this update, clusters managed by Ray will now automatically provision additional nodes when there is workload demand, even when starting from an idle (zero-node) state.Related issues
Closes #59682
Additional information