Skip to content

Conversation

@domsolutions
Copy link
Contributor

@domsolutions domsolutions commented Jan 6, 2026

Motivation

In order to provide HA for the scheduler, we need to use a DB. Currently the memory store for models and servers uses a struct, and they are updated via pointers. Obviously we can't update the DB via pointers, so we need to abstract out the storage calls and make update calls where necessary.

This PR imlements an in-memory store for the new memory interface, as we still wish to support non-HA if customers don't wish to use it.

Summary of changes

  • new proto definiton for the protos we'll be storing in the DB apis/mlops/scheduler/db/db.proto
  • integrated the current storage we have for pipelines/experiments into the above proto
  • added proto method helpers for model/servers here apis/go/mlops/scheduler/db/model_ext.go and apis/go/mlops/scheduler/db/server_ext.go these are essentially a copy/paste of the methods we had on ServerSnapshot and ModelVersion which we stored in the map
  • added update calls where needed within the store
  • large updates throughout codebase to use the new helpers
  • additionally introduced a LockServer on the store when doing server operations, this may not be necessary, as possibly noticed some races which I thought could be the issue due to server data being overwritten, but it didn't help. Can remove if we decide not necessary.

Checklist

  • Added/updated unit tests
  • Added/updated documentation
  • Checked for typos in variable names, comments, etc.
  • Added licences for new files

Testing

@lc525
Copy link
Member

lc525 commented Jan 7, 2026

Testing the newly enabled codex integration here, let's see how noisy or useful it gets. We haven't yet set an AGENTS.md to guide any of the reviewing process in accordance with out preferences, and (just to be clear) this does not replace the need for human reviewing/ownership:

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aeb9b2db81

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if err != nil {
return nil, coordinator.ServerEventMsg{}, err
}
modelVersion.Replicas[int32(request.ReplicaIdx)] = &db.ReplicaStatus{State: db.ModelReplicaState_Loaded}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Initialize replica map before assignment

If a model version is loaded from persistent storage with no replicas yet, the proto map field will be nil when unmarshaled. In that case, assigning directly into modelVersion.Replicas[...] will panic during agent subscribe (e.g., a server reconnecting with loaded models that were stored without replica entries). Consider using SetReplicaState or explicitly initializing the map before assignment to avoid nil-map writes in HA/DB-backed scenarios.

Useful? React with 👍 / 👎.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bogus comment. If a model shows up in the list of models that an agent declares it has loaded, it must by definition have at least the replica on that agent. If the scheduler didn't know about it, we are initialising the replicas map on line 95

Might be worth checking how we represent things in the case of a model with zero replicas, but any problems would not show up here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants