Skip to content

Conversation

@kash2104
Copy link
Contributor

@kash2104 kash2104 commented Dec 23, 2025

Why are these changes needed?

These changes are done to check the validation logic before rayCluster pod creation. It moves the replica validation logic as well as removes the redundant tests. Along with this, unit test are added since we moved the logic from utils.go to validation.go.

Related issue number

Closes #4101

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@JiangJiaWei1103
Copy link
Contributor

Thanks for your effort.

cc @Future-Outlier
This PR duplicates with #4116, which seemed inactive now. Should we close #4116 and work on this one?

@kash2104
Copy link
Contributor Author

kash2104 commented Dec 24, 2025

I have gone through #4116 but it was inactive and no changes were being made after the merging of volcano pr. Many changes are required since volcano pr's merging, that's why I opened up this PR.

@JiangJiaWei1103
Copy link
Contributor

Gotcha. Let's wait for maintainers' reply. I'll help review, thank you.

@kash2104
Copy link
Contributor Author

kash2104 commented Jan 1, 2026

@Future-Outlier Just wanted to follow up - do you prefer closing #4116 and continuing the work in this PR?

@Future-Outlier
Copy link
Member

@Future-Outlier Just wanted to follow up - do you prefer closing #4116 and continuing the work in this PR?

Hi, @kash2104 I just left a comment to #4116 and ask her if she have time to finish the work.
if after 2 weeks she haven't replied, you can take over it

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Future-Outlier
Copy link
Member

cursor review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

if *workerGroup.MinReplicas > *workerGroup.MaxReplicas {
return fmt.Errorf("worker group %s has minReplicas %d greater than maxReplicas %d", workerGroup.GroupName, *workerGroup.MinReplicas, *workerGroup.MaxReplicas)
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also check if workerGroup.Replicas lies within the workerGroup.MinReplicas and workerGroup.MaxReplicas`?

Copy link
Contributor Author

@kash2104 kash2104 Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@machichima Here in thevalidation.go, we are just checking whether the values for min and max replicas aren't incorrect or impossible before pod creation happens but the actual logic of number of replicas is moved to util.go

So I think that here in validation.go, we won't be needing this check.

Comment on lines 431 to 433
if nodeGroup.MinReplicas != nil {
minReplicas = *nodeGroup.MinReplicas
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use ptr.Deref, which is used in other places as well

ptr.Deref(rayServiceInstance.Status.ActiveServiceStatus.TargetCapacity, -1) == 0 &&
ptr.Deref(rayServiceInstance.Status.PendingServiceStatus.TrafficRoutedPercent, -1) == 100

1. Remove validation logic from GetWorkerGroupDesiredReplicas (utils.go)
and add this logic to ValidateRayClusterSpec (validation.go).
2. Remove unnecessary tests from TestGetWorkerGroupDesiredReplicas.
3. Remove the unused ctx.
This is added since we moved the validation logic.
Comment on lines -575 to -583
// Test 3: `WorkerGroupSpec.Replicas` is not nil but is more than maxReplicas.
replicas = int32(6)
workerGroupSpec.Replicas = &replicas
assert.Equal(t, GetWorkerGroupDesiredReplicas(ctx, workerGroupSpec), maxReplicas)

// Test 4: `WorkerGroupSpec.Replicas` is not nil but is less than minReplicas.
replicas = int32(0)
workerGroupSpec.Replicas = &replicas
assert.Equal(t, GetWorkerGroupDesiredReplicas(ctx, workerGroupSpec), minReplicas)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep those checks? I think this part is not moved to validation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The part that we have moved to the validation keeps in check that the desired state of the replicas are within the correct bounds or not after the pods have been created. So these tests were checking that logic by using the GetWorkerGroupDesiredReplicas and we have moved this logic to validation.go and thus have added the tests for them in validation_test.go as well.

Thus these checks have removed from util_test.go.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct me if I'm wrong, I think in validation we only check if maxReplicas and minReplicas are set correctly (comment: #4307 (comment)). The test here is for checking if the replicas will be clamped to max or min replica, so I think we should include them?

Comment on lines +2518 to +2534
{
name: "replicas smaller than minReplicas when autoscaling disabled",
spec: func() rayv1.RayClusterSpec {
s := createSpec()
s.WorkerGroupSpecs = []rayv1.WorkerGroupSpec{
{
GroupName: "worker-group-3",
Template: podTemplateSpec(nil, nil),
Replicas: ptr.To(int32(1)),
MinReplicas: ptr.To(int32(2)),
MaxReplicas: ptr.To(int32(5)),
},
}
return s
}(),
expectError: false,
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will not raise error because we did the clamp it to minReplicas in util.go? Could we please add comment or update the test name to briefly explain why this will not raise error? As usually we would expect error being raised in this case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Move Replica validation to validation.go

5 participants