Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat][kubectl-plugin] add scale command #2926

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

davidxia
Copy link
Contributor

@davidxia davidxia commented Feb 5, 2025

to scale a RayCluster's worker group.

closes #110

Example Usage

$ kubectl ray scale cluster -h                                                                                                                (base)
Scale a Ray cluster's worker group.

Usage:
  ray scale cluster (WORKERGROUP) (-c/--ray-cluster RAYCLUSTER) (-r/--replicas N) [flags]

Examples:
  # Scale a Ray cluster's worker group to 3 replicas
  kubectl ray scale cluster my-workergroup --ray-cluster my-raycluster --replicas 3

$ kubectl ray scale default-group --ray-cluster NONEXISTENT --replicas 0
Error: failed to scale worker group default-group in Ray cluster NONEXISTENT in namespace default: rayclusters.ray.io "NONEXISTENT" not found

$ kubectl ray scale DEADBEEF --ray-cluster dxia-test --replicas 1
Error: worker group DEADBEEF not found in Ray cluster dxia-test in namespace default. Available worker groups: default-group, another-group, yet-another-group

$ kubectl ray scale default-group --ray-cluster dxia-test --replicas 3
Scaled worker group default-group in Ray cluster dxia-test in namespace default from 0 to 3 replicas

$ kubectl ray scale default-group --ray-cluster dxia-test --replicas 1
Scaled worker group default-group in Ray cluster dxia-test in namespace default from 3 to 1 replicas

$ kubectl ray scale default-group --ray-cluster dxia-test --replicas -1
Error: must specify -r/--replicas with a non-negative integer

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@MortalHappiness
Copy link
Member

@davidxia let me know when this is ready for review.

@davidxia davidxia force-pushed the scale branch 2 times, most recently from e779b60 to aa8e35f Compare February 6, 2025 01:22
@davidxia davidxia marked this pull request as ready for review February 6, 2025 02:46
@davidxia
Copy link
Contributor Author

davidxia commented Feb 6, 2025

ready for review 🙏

cmdFactory := cmdutil.NewFactory(options.configFlags)

cmd := &cobra.Command{
Use: "scale [WORKERGROUP] [-c/--raycluster CLUSTERNAME] [-r/--replicas N]",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think kubectl scale cluster (CLUSTERNAME) (WORKERGROUP) [flags] would be better.

That is, create a cluster sub-command under scale, and make both the cluster name and workergroup name required arguments.

Also, note that required arguments should be wrapped with () and optional parameters with [] to be consistent with kubectl. You can check kubectl get --help or kubectl ray session --help for details.

cc @kevin85421 @andrewsykim What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like those semantics more. Are there other resources besides RayCluster that can be scaled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated to be a sub-command kubectl ray scale cluster, but I kept the -c/--ray-cluster flag because

Prefer flags to args. It’s a bit more typing, but it makes it much clearer what is going on. It also makes it easier to make changes to how you accept input in the future. Sometimes when using args, it’s impossible to add new input without breaking existing behavior or creating ambiguity.

If you’ve got two or more arguments for different things, you’re probably doing something wrong.

— https://clig.dev/

Lmk if this works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then how about also make worker group as a new flag like --worker-group?

Copy link
Contributor Author

@davidxia davidxia Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I kind of like how the current command follows the VERB NOUN NAME semantics of the other commands like kubectl ray create cluster NAME or kubectl ray get workergroups NAME. I think the help message makes it apparent the first positional arg has to be the worker group name.

But not a strong opinion. Happy to change especially if others like that more too.

davidxia added a commit to davidxia/kuberay that referenced this pull request Feb 10, 2025
We plan to add a [command to scale a worker group][1] in an existing
RayCluster, e.g. `kubectl ray scale cluster (CLUSTER_NAME) (WORKER_GROUP)`. This
requires the user to know the group name. There's currently no way to get the
worker group names in a RayCluster other than getting the resource with kubectl
and looking for or parsing out the names. A command to get worker groups
details for a cluster might be helpful.

[1]: ray-project#2926

Signed-off-by: David Xia <david@davidxia.com>
davidxia added a commit to davidxia/kuberay that referenced this pull request Feb 10, 2025
We plan to add a [command to scale a worker group][1] in an existing
RayCluster, e.g. `kubectl ray scale cluster (CLUSTER_NAME) (WORKER_GROUP)`. This
requires the user to know the group name. There's currently no way to get the
worker group names in a RayCluster other than getting the resource with kubectl
and looking for or parsing out the names. A command to get worker groups
details for a cluster might be helpful.

[1]: ray-project#2926

Signed-off-by: David Xia <david@davidxia.com>
davidxia added a commit to davidxia/kuberay that referenced this pull request Feb 10, 2025
We plan to add a [command to scale a worker group][1] in an existing
RayCluster, e.g. `kubectl ray scale cluster (CLUSTER_NAME) (WORKER_GROUP)`. This
requires the user to know the group name. There's currently no way to get the
worker group names in a RayCluster other than getting the resource with kubectl
and looking for or parsing out the names. A command to get worker groups
details for a cluster might be helpful.

[1]: ray-project#2926

Signed-off-by: David Xia <david@davidxia.com>
davidxia added a commit to davidxia/kuberay that referenced this pull request Feb 10, 2025
We plan to add a [command to scale a worker group][1] in an existing
RayCluster, e.g. `kubectl ray scale cluster (CLUSTER_NAME) (WORKER_GROUP)`. This
requires the user to know the group name. There's currently no way to get the
worker group names in a RayCluster other than getting the resource with kubectl
and looking for or parsing out the names. A command to get worker groups
details for a cluster might be helpful.

[1]: ray-project#2926

Signed-off-by: David Xia <david@davidxia.com>
davidxia added a commit to davidxia/kuberay that referenced this pull request Feb 11, 2025
We plan to add a [command to scale a worker group][1] in an existing
RayCluster, e.g. `kubectl ray scale cluster (CLUSTER_NAME) (WORKER_GROUP)`. This
requires the user to know the group name. There's currently no way to get the
worker group names in a RayCluster other than getting the resource with kubectl
and looking for or parsing out the names. A command to get worker groups
details for a cluster might be helpful.

[1]: ray-project#2926

Signed-off-by: David Xia <david@davidxia.com>
davidxia added a commit to davidxia/kuberay that referenced this pull request Feb 11, 2025
We plan to add a [command to scale a worker group][1] in an existing
RayCluster, e.g. `kubectl ray scale cluster (CLUSTER_NAME) (WORKER_GROUP)`. This
requires the user to know the group name. There's currently no way to get the
worker group names in a RayCluster other than getting the resource with kubectl
and looking for or parsing out the names. A command to get worker groups
details for a cluster might be helpful.

[1]: ray-project#2926

Signed-off-by: David Xia <david@davidxia.com>
davidxia added a commit to davidxia/kuberay that referenced this pull request Feb 12, 2025
We plan to add a [command to scale a worker group][1] in an existing
RayCluster, e.g. `kubectl ray scale cluster (CLUSTER_NAME) (WORKER_GROUP)`. This
requires the user to know the group name. There's currently no way to get the
worker group names in a RayCluster other than getting the resource with kubectl
and looking for or parsing out the names. A command to get worker groups
details for a cluster might be helpful.

[1]: ray-project#2926

Signed-off-by: David Xia <david@davidxia.com>
davidxia added a commit to davidxia/kuberay that referenced this pull request Feb 12, 2025
We plan to add a [command to scale a worker group][1] in an existing
RayCluster, e.g. `kubectl ray scale cluster (CLUSTER_NAME) (WORKER_GROUP)`. This
requires the user to know the group name. There's currently no way to get the
worker group names in a RayCluster other than getting the resource with kubectl
and looking for or parsing out the names. A command to get worker groups
details for a cluster might be helpful.

[1]: ray-project#2926

Signed-off-by: David Xia <david@davidxia.com>
davidxia added a commit to davidxia/kuberay that referenced this pull request Feb 13, 2025
We plan to add a [command to scale a worker group][1] in an existing
RayCluster, e.g. `kubectl ray scale cluster (CLUSTER_NAME) (WORKER_GROUP)`. This
requires the user to know the group name. There's currently no way to get the
worker group names in a RayCluster other than getting the resource with kubectl
and looking for or parsing out the names. A command to get worker groups
details for a cluster might be helpful.

## Example Usage

```console
$ kubectl ray get workergroups -A
NAMESPACE                NAME            REPLICAS   CPUS      GPUS   TPUS   MEMORY         CLUSTER
default                  default-group   1/1        2         0      0      4Gi            dxia-test
explorer-one             gpuWorker       1/1        36        4      0      256Gi          yzhao
foundation-models        cpuWorker       1/1        20        0      0      200Gi          jacquelinew-v3
foundation-models        gpuWorker       1/1        200       8      0      1000Gi         jacquelinew-v3
foundation-models        redis           1/1        13        0      0      112Gi          jacquelinew-v3

$ kubectl ray get workergroups -n default
NAME            REPLICAS   CPUS   GPUS   TPUS   MEMORY   CLUSTER
default-group   1/1        2      0      0      4Gi      dxia-test

$ kubectl ray get workergroups -n foundation-models -c jacquelinew-v3
NAME        REPLICAS   CPUS   GPUS   TPUS   MEMORY   CLUSTER
cpuWorker   1/1        20     0      0      200Gi    jacquelinew-v3
gpuWorker   1/1        200    8      0      1000Gi   jacquelinew-v3
redis       1/1        13     0      0      112Gi    jacquelinew-v3

$ kubectl ray get workergroups gpuWorker -A
NAMESPACE           NAME        REPLICAS   CPUS   GPUS   TPUS   MEMORY   CLUSTER
explorer-one        gpuWorker   1/1        36     4      0      256Gi    yzhao
foundation-models   gpuWorker   1/1        200    8      0      1000Gi   jacquelinew-v3
```

[1]: ray-project#2926

Signed-off-by: David Xia <david@davidxia.com>
davidxia added a commit to davidxia/kuberay that referenced this pull request Feb 13, 2025
We plan to add a [command to scale a worker group][1] in an existing
RayCluster, e.g. `kubectl ray scale cluster (CLUSTER_NAME) (WORKER_GROUP)`. This
requires the user to know the group name. There's currently no way to get the
worker group names in a RayCluster other than getting the resource with kubectl
and looking for or parsing out the names. A command to get worker groups
details for a cluster might be helpful.

## Example Usage

```console
$ kubectl ray get workergroups -A
NAMESPACE                NAME            REPLICAS   CPUS      GPUS   TPUS   MEMORY         CLUSTER
default                  default-group   1/1        2         0      0      4Gi            dxia-test
explorer-one             gpuWorker       1/1        36        4      0      256Gi          yzhao
foundation-models        cpuWorker       1/1        20        0      0      200Gi          jacquelinew-v3
foundation-models        gpuWorker       1/1        200       8      0      1000Gi         jacquelinew-v3
foundation-models        redis           1/1        13        0      0      112Gi          jacquelinew-v3

$ kubectl ray get workergroups -n default
NAME            REPLICAS   CPUS   GPUS   TPUS   MEMORY   CLUSTER
default-group   1/1        2      0      0      4Gi      dxia-test

$ kubectl ray get workergroups -n foundation-models -c jacquelinew-v3
NAME        REPLICAS   CPUS   GPUS   TPUS   MEMORY   CLUSTER
cpuWorker   1/1        20     0      0      200Gi    jacquelinew-v3
gpuWorker   1/1        200    8      0      1000Gi   jacquelinew-v3
redis       1/1        13     0      0      112Gi    jacquelinew-v3

$ kubectl ray get workergroups gpuWorker -A
NAMESPACE           NAME        REPLICAS   CPUS   GPUS   TPUS   MEMORY   CLUSTER
explorer-one        gpuWorker   1/1        36     4      0      256Gi    yzhao
foundation-models   gpuWorker   1/1        200    8      0      1000Gi   jacquelinew-v3
```

[1]: ray-project#2926

Signed-off-by: David Xia <david@davidxia.com>
andrewsykim pushed a commit that referenced this pull request Feb 13, 2025
We plan to add a [command to scale a worker group][1] in an existing
RayCluster, e.g. `kubectl ray scale cluster (CLUSTER_NAME) (WORKER_GROUP)`. This
requires the user to know the group name. There's currently no way to get the
worker group names in a RayCluster other than getting the resource with kubectl
and looking for or parsing out the names. A command to get worker groups
details for a cluster might be helpful.

## Example Usage

```console
$ kubectl ray get workergroups -A
NAMESPACE                NAME            REPLICAS   CPUS      GPUS   TPUS   MEMORY         CLUSTER
default                  default-group   1/1        2         0      0      4Gi            dxia-test
explorer-one             gpuWorker       1/1        36        4      0      256Gi          yzhao
foundation-models        cpuWorker       1/1        20        0      0      200Gi          jacquelinew-v3
foundation-models        gpuWorker       1/1        200       8      0      1000Gi         jacquelinew-v3
foundation-models        redis           1/1        13        0      0      112Gi          jacquelinew-v3

$ kubectl ray get workergroups -n default
NAME            REPLICAS   CPUS   GPUS   TPUS   MEMORY   CLUSTER
default-group   1/1        2      0      0      4Gi      dxia-test

$ kubectl ray get workergroups -n foundation-models -c jacquelinew-v3
NAME        REPLICAS   CPUS   GPUS   TPUS   MEMORY   CLUSTER
cpuWorker   1/1        20     0      0      200Gi    jacquelinew-v3
gpuWorker   1/1        200    8      0      1000Gi   jacquelinew-v3
redis       1/1        13     0      0      112Gi    jacquelinew-v3

$ kubectl ray get workergroups gpuWorker -A
NAMESPACE           NAME        REPLICAS   CPUS   GPUS   TPUS   MEMORY   CLUSTER
explorer-one        gpuWorker   1/1        36     4      0      256Gi    yzhao
foundation-models   gpuWorker   1/1        200    8      0      1000Gi   jacquelinew-v3
```

[1]: #2926

Signed-off-by: David Xia <david@davidxia.com>
@davidxia davidxia marked this pull request as draft February 13, 2025 22:13
@davidxia davidxia force-pushed the scale branch 2 times, most recently from f966a7c to 413dba6 Compare February 14, 2025 14:43
@davidxia davidxia marked this pull request as ready for review February 14, 2025 14:44
@davidxia
Copy link
Contributor Author

@MortalHappiness ready for another review. Do you think we need to address #110 (comment)?

I noticed that if I scale down a group from, for example, 3 to 1, sometimes all worker Pods are terminated and a new one created. Is this expected controller behavior?

@MortalHappiness
Copy link
Member

Do you think we need to address #110 (comment)?

cc @kevin85421 @andrewsykim WDYT?

I noticed that if I scale down a group from, for example, 3 to 1, sometimes all worker Pods are terminated and a new one created. Is this expected controller behavior?

I don't think this is the expected behavior.

@davidxia
Copy link
Contributor Author

I don't think this is the expected behavior.

The behavior is unrelated to this PR since it's the controller. But I'm curious if others can repro. I can repro it just with kubectl edit raycluster, scaling a worker group with 3 replicas down to 1.

to scale a RayCluster's worker group.

closes ray-project#110

## Example Usage

```console
$ kubectl ray scale cluster -h                                                                                                                (base)
Scale a Ray cluster's worker group.

Usage:
  ray scale cluster (WORKERGROUP) (-c/--ray-cluster RAYCLUSTER) (-r/--replicas N) [flags]

Examples:
  # Scale a Ray cluster's worker group to 3 replicas
  kubectl ray scale cluster my-workergroup --ray-cluster my-raycluster --replicas 3

$ kubectl ray scale default-group --ray-cluster NONEXISTENT --replicas 0
Error: failed to scale worker group default-group in Ray cluster NONEXISTENT in namespace default: rayclusters.ray.io "NONEXISTENT" not found

$ kubectl ray scale DEADBEEF --ray-cluster dxia-test --replicas 1
Error: worker group DEADBEEF not found in Ray cluster dxia-test in namespace default. Available worker groups: default-group, another-group, yet-another-group

$ kubectl ray scale default-group --ray-cluster dxia-test --replicas 3
Scaled worker group default-group in Ray cluster dxia-test in namespace default from 0 to 3 replicas

$ kubectl ray scale default-group --ray-cluster dxia-test --replicas 1
Scaled worker group default-group in Ray cluster dxia-test in namespace default from 3 to 1 replicas

$ kubectl ray scale default-group --ray-cluster dxia-test --replicas -1
Error: must specify -r/--replicas with a non-negative integer
```

Signed-off-by: David Xia <david@davidxia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Add kubectl plugin to help scale ray raycluster
2 participants