Skip to content

Conversation

@400Ping
Copy link
Contributor

@400Ping 400Ping commented Oct 24, 2025

Why are these changes needed?

When using sidecar mode, the head pod should not be recreated after it is deleted. The RayJob should be marked as Failed.

Related issue number

Closes #4130

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: 400Ping <fourhundredping@gmail.com>
@400Ping 400Ping marked this pull request as draft October 24, 2025 16:51
@400Ping 400Ping marked this pull request as ready for review October 24, 2025 17:35
Signed-off-by: 400Ping <fourhundredping@gmail.com>
@rueian
Copy link
Collaborator

rueian commented Oct 24, 2025

@400Ping, the change should be made in the raycluster_controller. We need to make the raycluster_controller not recreate the head pod if the cluster belongs to a RayJob, so that we can avoid races where the raycluster_controller recreates the head before the rayjob_controller checks it.

@400Ping 400Ping marked this pull request as draft October 25, 2025 00:36
@400Ping
Copy link
Contributor Author

400Ping commented Oct 25, 2025

@400Ping, the change should be made in the raycluster_controller. We need to make the raycluster_controller not recreate the head pod if the cluster belongs to a RayJob, so that we can avoid races where the raycluster_controller recreates the head before the rayjob_controller checks it.

ok, thanks

Signed-off-by: 400Ping <fourhundredping@gmail.com>
Signed-off-by: 400Ping <fourhundredping@gmail.com>
originatedFrom := utils.GetCRDType(instance.Labels[utils.RayOriginatedFromCRDLabelKey])
if originatedFrom == utils.RayJobCRD {
logger.Info(
"reconcilePods: Found 0 head Pods for a RayJob-managed RayCluster; skipping head creation to let RayJob controller handle the failure",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this cause no head pod to be created at all? We still need to create the first head pod. I think you can check the RayClusterProvisioned condition to decide whether to create one or not.

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's try to fix this, super important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Sidecar mode shouldn't restart head pod when head pod is deleted

3 participants