-
Notifications
You must be signed in to change notification settings - Fork 855
feat(scheduler): abstract model/server storage #7070
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: v2
Are you sure you want to change the base?
Conversation
|
Testing the newly enabled codex integration here, let's see how noisy or useful it gets. We haven't yet set an AGENTS.md to guide any of the reviewing process in accordance with out preferences, and (just to be clear) this does not replace the need for human reviewing/ownership: @codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aeb9b2db81
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
scheduler/pkg/store/memory.go
Outdated
| if err != nil { | ||
| return nil, coordinator.ServerEventMsg{}, err | ||
| } | ||
| modelVersion.Replicas[int32(request.ReplicaIdx)] = &db.ReplicaStatus{State: db.ModelReplicaState_Loaded} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initialize replica map before assignment
If a model version is loaded from persistent storage with no replicas yet, the proto map field will be nil when unmarshaled. In that case, assigning directly into modelVersion.Replicas[...] will panic during agent subscribe (e.g., a server reconnecting with loaded models that were stored without replica entries). Consider using SetReplicaState or explicitly initializing the map before assignment to avoid nil-map writes in HA/DB-backed scenarios.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bogus comment. If a model shows up in the list of models that an agent declares it has loaded, it must by definition have at least the replica on that agent. If the scheduler didn't know about it, we are initialising the replicas map on line 95
Might be worth checking how we represent things in the case of a model with zero replicas, but any problems would not show up here.
aeb9b2d to
380e3ba
Compare
…rive correct statuses of model/replicas
… uint32 to int32 conversion
…icas. Added proto for DB dump endpoint to aid debugging
b387770 to
14fc906
Compare
Motivation
In order to provide HA for the scheduler, we need to use a DB. Currently the memory store for models and servers uses a struct, and they are updated via pointers. Obviously we can't update the DB via pointers, so we need to abstract out the storage calls and make
updatecalls where necessary.This PR imlements an in-memory store for the new memory interface, as we still wish to support non-HA if customers don't wish to use it.
Summary of changes
apis/mlops/scheduler/db/db.protoapis/go/mlops/scheduler/db/model_ext.goandapis/go/mlops/scheduler/db/server_ext.gothese are essentially a copy/paste of the methods we had onServerSnapshotandModelVersionwhich we stored in the mapLockServeron the store when doing server operations, this may not be necessary, as possibly noticed some races which I thought could be the issue due to server data being overwritten, but it didn't help. Can remove if we decide not necessary.Checklist
Testing