Skip to content

Conversation

chiayi
Copy link
Contributor

@chiayi chiayi commented Aug 28, 2025

Why are these changes needed?

Part of #3902. POC, Adds group indexing and host index to multi-host workers.

These labels are useful for running workloads with workers when numOfHosts > 1 and/or replicas > 1 for a worker group, where it's important that the workload runs on a specific worker or numOfHosts workers scaled as a part of the same replica. This is the case for TPU or GPU workloads with topology or worker index requirements.

Additionally, this PR adds logic to atomically delete a worker group replica where numOfHosts > 1. This logic is necessary because most multi-host workloads will hang and fail when a single worker in the group fails, and we should delete or restart these Pods together.

Related issue number

For: #3902

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@chiayi chiayi force-pushed the multihost-indexing branch 2 times, most recently from 8a74046 to a6b94b3 Compare August 28, 2025 17:32
@chiayi
Copy link
Contributor Author

chiayi commented Aug 28, 2025

@ryanaoleary PTAL when you get the chance.

@ryanaoleary
Copy link
Collaborator

ryanaoleary commented Aug 29, 2025

It'd be good to make clear the value of this PR.

Currently host and replica indexing for multi-host workers occurs in a separate GKE webhook that injects these values as env vars and a k8s label. The env vars and replicaIndex label are then read from within Ray to set Raylet labels and do things like atomically scale multi-host slices with autoscaling.

This PR moves the logic for indexing KubeRay worker Pods that request TPU from the webhook to KubeRay itself. By assigning indices as k8s Pod labels directly from KubeRay when they are created, we avoid the necessity for complicated logic in the TPU webhook that tracks the state of multi-host replicas in a RayCluster using a PodInformer. Since these variables are already used in Ray core and libraries like Train to handle the multi-host case, it makes sense to consolidate the logic in KubeRay. Additionally, since KubeRay is aware of when Pods are deleted, it becomes easier to scale-down multi-host replicas atomically. Overall, this PR is consolidating logic that is currently spread across the TPU webhook, KubeRay, and Ray core.

The next step after this PR would be to move the environment variable injection that occurs in the TPU webhook to Ray core when the Raylet is started on a node. The worker lifecycle would then look as follows for multi-host workers:

  1. Ray sends a request to scale N workers to satisfy resource requests for TPU
  2. KubeRay scales up a multi-host (NumOfHosts > 1) replica with N Pods (1 Pod per host) and indexes each worker using k8s labels
  3. When the Raylet is started on each worker Pod, the information in the k8s Pod spec (ray.io/host-index and ray.io/replica-index labels) is used to set required Jax environment variables like TPU_WORKER_ID, TPU_WORKER_HOSTNAMES, and TPU_NAME and the corresponding Raylet node labels for label based scheduling.
  4. When a multi-host worker is deleted by KubeRay, we can check the ray.io/replica-index label to scale down the entire slice atomically.

@chiayi chiayi force-pushed the multihost-indexing branch from a6b94b3 to 6935b9e Compare August 29, 2025 17:56
}

// Check if RayTpuMulithostIndexing feature is enabled
// Currently multihostIndexing won't work with Autoscaler v2 since autoscaler delete doesn't follow replica groups
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Currently multihostIndexing won't work with Autoscaler v2 since autoscaler delete doesn't follow replica groups

Can you explain more why this wouldn't work with the v2 autoscaler? Since it currently scales by replicas my initial thinking was that there shouldn't be an incompatibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, if that's the case then it is an misunderstanding on my side based on our prior discussion where my interpretation was that there was incompatibility due to how it scaled the replicas. Will remove the autoscaling v2 check.

Copy link
Collaborator

@ryanaoleary ryanaoleary Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could just be forgetting what the incompatibility is, if I'm remembering correctly the v2 autoscaler determines the number of replicas of a group to scale, and then submits a scale request by patching both the replica count and workersToDelete of that group here.

There could be an issue with how the v2 autoscaler scales down here, since it doesn't consider whether to_delete_id is part of a multi-host group and will cause the entire group to scale down, but I think this might be fine though since we consider NumOfHosts in the desired num workers of a type here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add an e2e autoscaler test here https://github.com/ray-project/kuberay/tree/master/ray-operator/test/e2eautoscaler to verify the scale up/down behavior for both the V1 and V2 autoscaler in either this PR or a follow-up. That should probably be one of the requirements for moving this feature from alpha to beta.

@chiayi chiayi force-pushed the multihost-indexing branch 3 times, most recently from 45d53ae to bb602d5 Compare October 2, 2025 15:34
@ryanaoleary
Copy link
Collaborator

Once this is passing CI I think we can mark this as ready for review and ask other KubeRay contributors to review the feature.

@chiayi chiayi force-pushed the multihost-indexing branch 2 times, most recently from 5bebd86 to 1f21e83 Compare October 3, 2025 15:34
@chiayi chiayi marked this pull request as ready for review October 3, 2025 19:04
@chiayi chiayi force-pushed the multihost-indexing branch from 1f21e83 to 7419b56 Compare October 6, 2025 16:53
@chiayi chiayi changed the title [POC] Prototype multi-host indexing [TPU][RayCluster] Prototype multi-host indexing Oct 6, 2025
@andrewsykim andrewsykim changed the title [TPU][RayCluster] Prototype multi-host indexing [RayCluster] Prototype multi-host indexing Oct 6, 2025
return errstd.Join(utils.ErrFailedCreateWorkerPod, err)

// Worker creation path for multi-host indexing
if multihostIndexingEnabled {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic below seems pretty complicated, should we abstract it away in a util package?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we do since all it's really doing is going through and creating the remaining workers in groups. Abstracting it away in a util package will introduce another layer of indirection and I thought that might be unnecessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't think a util package is necessary, but I moved it to it's own reconciliation function in e9a8b23 because reconcilePods was really long and I wanted it to be really clear what's behind the feature gate.

@andrewsykim
Copy link
Member

I removed the "[TPU]" from the title, we should ensure this implementation is generic enough to also be used for GPUs. For e.g. labelling worker pods in an NVLink using GB200s

@chiayi chiayi force-pushed the multihost-indexing branch 2 times, most recently from daac294 to 786db45 Compare October 7, 2025 23:40
@andrewsykim andrewsykim changed the title [RayCluster] Prototype multi-host indexing [RayCluster] Add multi-host indexing labels Oct 14, 2025
@rueian rueian added the 1.5.0 label Oct 14, 2025
logger.Info("Uneven number of pods were found for multi-host worker groups",
"worker group", worker.GroupName,
)
return fmt.Errorf("%d pods were found, was expecting multiple of %d for group %s", len(headPods.Items), worker.NumOfHosts, worker.GroupName)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if I delete a pod in the group manually, the reconciliation on the RayCluster will be stuck until I delete them all?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that can be ok as long as we have a clear message for users to follow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is just to check the case where multi-host indexing wasn't enabled at first and user deletes one of the pods in the replica groups and then enables multi-host indexing. If multi-host indexing is enabled the whole time, manually deleting one of the pods should trigger the whole replica group to be deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a, hopefully, clearer error msg.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, but where is the code that triggers the whole replica group to be deleted?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is in this block:

// find all worker pods with the same replica group
replicaGrpName := workerPod.Labels[utils.RayWorkerReplicaIndexKey]
replicaGrpPodListToDelete := workerGrpReplicaMap[replicaGrpName]
// delete all pods within the group replica
for _, replicaGrpPod := range replicaGrpPodListToDelete {

when we're iterating through the worker pods to delete:

for _, workerPod := range workerPods.Items {
    shouldDelete, reason := shouldDeletePod(workerPod, rayv1.WorkerNode)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it guaranteed that when one pod is missing from the group, then the others will have shouldDelete=true?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No actually, I refactored the code and now we handle that case separately. There are three cases in which we scale down the entire replica group:

  1. 1 Pod is unhealthy (shouldDelete = True). If it's part of a group in the map, we delete all the Pods in that group with it, even if the other Pods are healthy.
  2. The Pod name is in WorkersToDelete. The same as the above, if the Pod to be deleted is in a multi-host group we add all the Pods in the same group with it to be deleted.
  3. diff < 0. This is the regular scale down, we calculate the number of groups to be deleted to satisfy the diff and then remove the entire multi-host group(s).

@chiayi chiayi force-pushed the multihost-indexing branch from 786db45 to 6939e9b Compare October 14, 2025 23:29
Signed-off-by: Aaron Liang <aaronliang@google.com>
@chiayi chiayi force-pushed the multihost-indexing branch from 6939e9b to 7a5659a Compare October 14, 2025 23:54
@Future-Outlier Future-Outlier self-assigned this Oct 15, 2025
@ryanaoleary
Copy link
Collaborator

cc: @rueian @Future-Outlier Aaron is OOO, but I have context on this PR and can work on fixing any comments and getting it merge ready in the meantime since I have access to push to it. Similar labelling logic was implemented in this GKE webhook: https://github.com/ai-on-gke/kuberay-tpu-webhook that I wrote, and similar deletion logic for replica groups in the Ray autoscaler in the code here.

This PR consolidates the logic to set host index and replica index in KubeRay when the Pods are created, and generalizes it away from being TPU specific since it can also support GPU use-cases. These labels will help observability when running multi-host workloads, as well as enable us to atomically scale down multi-host slices rather than forcing the user to manually clean up Pods when a workload hangs due to a single worker crashing.

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
@rueian
Copy link
Collaborator

rueian commented Oct 17, 2025

Hi @ryanaoleary, could you help to merge this with the master branch again? There is a fix for tests.

machichima and others added 2 commits October 17, 2025 20:42
* feat: add grace period after sumitter finished

* feat: generate new field in rayjob CRD

* fix: fix terminate logic + add default timeout value

* refactor: remove submitter finish timeout in rayjob status & use default timeout

* feat: get finish time from job and container

* test: for submitter finished timeout

* fix: set finishedAt to nil if submitter not finished

* test: ensure timeout close to the set timeout

* refactor: make status msg more readable

* fix: simplify finishedAt time format

* refactor: remove isSubmitterFinished and use finishedAt only

* fix: add LastTransitionTime to JobCondition in test

* Trigger CI

Signed-off-by: machichima <nary12321@gmail.com>

* refactor: fix lint and nit

* fix: fix test

---------

Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
@ryanaoleary
Copy link
Collaborator

Hi @ryanaoleary, could you help to merge this with the master branch again? There is a fix for tests.

@rueian Done, should be up to date now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants