-
Notifications
You must be signed in to change notification settings - Fork 41
ModelServing Role Support LeaderWorkerSet #609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ModelServing Role Support LeaderWorkerSet #609
Conversation
Summary of ChangesHello @zhiweideren, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request proposes an enhancement to Kthena ModelServing by integrating LeaderWorkerSet (LWS) support. The primary goal is to enable ModelServing to effectively orchestrate leader/worker style "super-pod" workloads, benefiting from LWS's superior lifecycle management, coordinated restarts, and stable identity. The proposal evaluates various approaches and recommends "Extended Role Configuration," which allows specific ModelServing roles to be optionally backed by LWS resources, ensuring existing ModelServing routing and autoscaling integrations remain functional. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a design proposal for adding LeaderWorkerSet (LWS) support to Kthena ModelServing. The proposal is well-structured, comprehensive, and clearly evaluates multiple solutions before recommending one. The recommended approach (Solution 3: Extended Role Configuration) seems sound and balances user experience with implementation complexity.
My feedback focuses on adding more specific details to the proposed API changes and clarifying the behavior for certain lifecycle operations, such as migrating a role between being LWS-backed and natively managed. These clarifications will help solidify the design and prevent potential user confusion.
cc62948 to
276da2a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces LeaderWorkerSet (LWS) role support for Kthena ModelServing, enabling users to submit LeaderWorkerSet Custom Resources that are automatically converted to ModelServing internal resources. The implementation creates a new LWS controller that watches LeaderWorkerSet CRs and maps them to ModelServing resources without requiring the native LWS controller to be deployed.
Key Changes
- Adds a new LWS controller that translates LeaderWorkerSet specifications to ModelServing resources
- Integrates LWS controller startup into the main controller lifecycle
- Includes comprehensive proposal documentation explaining the architecture and mapping rules
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/model-serving-controller/controller/lws_controller.go | New controller implementing LWS to ModelServing conversion and reconciliation logic |
| pkg/controller/controller.go | Integration point for starting the LWS controller when ModelServing controller is enabled |
| docs/proposal/modelserving-role-support-leaderworkerset.md | Detailed proposal document describing architecture, field mappings, and implementation requirements |
| go.mod | Adds sigs.k8s.io/lws v0.7.0 dependency and indirect dependencies |
| go.sum | Checksum entries for new dependencies |
| licenses/sigs.k8s.io/lws/LICENSE | Apache 2.0 license for LWS dependency |
| licenses/golang.org/x/sync/LICENSE | BSD license for golang.org/x/sync dependency |
| licenses/github.com/google/btree/LICENSE | Apache 2.0 license for btree dependency |
| licenses/github.com/fsnotify/fsnotify/LICENSE | BSD license for fsnotify dependency |
| docs/kthena/docs/reference/helm-chart-values.md | Minor formatting change (trailing whitespace) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| workerReplicas = 0 | ||
| } | ||
|
|
||
| roleReplicas := int32(1) |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential semantic mismatch between proposal and implementation. The proposal document (section 4.1, line 88) states that spec.replicas should map to ModelServing replicas, defining "the number of replicas for ServingGroup". However, the implementation sets roleReplicas to a hardcoded value of 1 (line 212) instead of using lws.Spec.Replicas. This means the LWS replicas are being used for ModelServing.Spec.Replicas (number of groups), while each role within a group always has exactly 1 replica. Verify if this mapping is correct according to the LWS semantics, as it may not align with user expectations from the LWS API.
40f82b2 to
509fb04
Compare
…oller Signed-off-by: liuhaiyu <liuhaiyu13@huawei.com>
…oller Signed-off-by: liuhaiyu <liuhaiyu13@huawei.com>
509fb04 to
1a59635
Compare
…oller Signed-off-by: liuhaiyu <liuhaiyu13@huawei.com>
1a59635 to
b6e3c59
Compare
charts/kthena/charts/workload/templates/kthena-controller-manager/rbac/cluster-role.yaml
Outdated
Show resolved
Hide resolved
charts/kthena/charts/workload/templates/kthena-controller-manager/rbac/cluster-role.yaml
Outdated
Show resolved
Hide resolved
| sigs.k8s.io/structured-merge-diff/v6 v6.3.0 | ||
| sigs.k8s.io/yaml v1.6.0 | ||
| volcano.sh/apis v0.0.0-20251222111824-40fed4793a6f | ||
| volcano.sh/apis v1.13.2-0.20260105020123-deb066235db8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LiZhenCheng9527 Please help confirm whether the subgroup supported in in this release, otherwise should revert this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Support subgroup in this release
| } | ||
|
|
||
| lwsInformerFactory.Start(ctx.Done()) | ||
| kthenaInformerFactory.Start(ctx.Done()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think start here maybe unsafe, since we have also started it in other controller. We should make sure all the resource event handler registered before informer started, otherwise we may miss some resource events
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NewSharedInformerFactory and start were split to ensure that all event handlers were registered before all informer started
|
@zhiweideren We should also add some test cases covering the translation, especially note the corner case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
…odelServing Signed-off-by: liuhaiyu <liuhaiyu13@huawei.com>
Signed-off-by: liuhaiyu <liuhaiyu13@huawei.com>
…lServing Signed-off-by: liuhaiyu <liuhaiyu13@huawei.com>
done.add lws_controller_test.go |
hzxuzhonghu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: now we create separate informers each controller, which could be shared.
…ttings in LWS controller Signed-off-by: liuhaiyu <liuhaiyu13@huawei.com>
|
/lgtm @zl-cheng @LiZhenCheng9527 @git-malu take another look? |
…elsering_support_lws_issuer_407 # Conflicts: # charts/kthena/charts/workload/templates/kthena-controller-manager/rbac/cluster-role.yaml
|
Adding label DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/lgtm |
|
/approve |
1 similar comment
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hzxuzhonghu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hzxuzhonghu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
enhancement
What this PR does / why we need it:
This proposal introduces LeaderWorkerSet (LWS) role support for Kthena ModelServing. The goal is to enable ModelServing to orchestrate leader/worker style “super-pod” workloads via LWS where it provides better lifecycle semantics (group-level rollout, coordinated restart behaviors, stable identity and networking patterns), while preserving the existing ModelServing user experience and integrations (ModelServer/ModelRoute selection and autoscaling).
Which issue(s) this PR fixes:
Fixes #407
Special notes for your reviewer:
Determine which solution to use
Does this PR introduce a user-facing change?: