DynamoModel: Kubernetes Custom Resource to simplify Model Lifecycle Management #45

biswapanda · 2025-10-07T17:44:28Z

This proposal introduces DynamoModel, a dedicated Kubernetes Custom Resource (CR) for managing model lifecycle in the Dynamo ecosystem. DynamoModel decouples model downloading, versioning, and caching from DynamoGraphDeployment (DGD), enabling consistent model references across deployments, benchmarks, and services while eliminating boilerplate code and preventing model version drift.

apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
  name: llama-3-70b-instruct-v1
  namespace: dynamo-system
spec:
  # Model identification
  modelName: meta-llama/Llama-3.3-70B-Instruct
  version: 8a4556b53a7d81d7e07db15eafb5af5dcd321b33  # HuggingFace commit SHA
  # Source configuration
  source:
    uri: hf://meta-llama/Llama-3.3-70B-Instruct
    secretRef:
      name: huggingface-token
      key: token
  # Storage configuration
  storage:
    pvc:
      create: true                       # Auto-create PVC
      name: llama-3-70b-instruct-v1-pvc  # Optional explicit name override defaults to <cr-name>-pvc
      storageClassName: fast-nvme        # Simple field for convenience
      size: 150Gi                        # Simple field for convenience
      accessModes:
          - ReadWriteMany
      extraPvcSpec: {}
    # OR reference existing PVC
    # pvc:
    #   name: existing-model-cache
    #   subPath: llama-3-70b

  # Optional: Download configuration (defaults to HF Downloader or Base Dynamo image with HF)
  downloader:
    image: my-registry/hf-downloader:my-tag # HF Downloader
    resources: {}
    retryLimit: 5
    timeout: 3600s

julienmancuso · 2025-10-07T19:57:30Z

model-crd/dynamo-dgd-lifecycle.png

the DGD controller shouldn't launch the model job.
The new DynamoModel should be in charge of reconciling all DynamoModel and launching jobs for models.
DGD reconciliation loop should requeue until DynamoModel is ready.

agreed, I'll fix the diagram.

itay · 2025-10-08T00:05:12Z

@biswapanda I think it would be worthwhile to at least do two thought experiments about how we'd use the new CR with the following constraints:

Imagine we don't have a shared ReadWriteMany PVC and instead only have node-local storage. How would we utilize the CR?
Imagine that the CR is not in charge of launching independent download jobs, i.e. we don't use the CR for caching purposes, but just to centralize information and lifecycle.

I think this would give us some insight if we're designing this well enough. Happy to expand on the above if needed.

itay · 2025-10-08T00:06:08Z

model-crd/dynamo-model-crd.md

+### Storage Persistence
+Downloaded model weights MUST be stored in Persistent Volume Claims (PVCs) that persist beyond the lifecycle of individual DGDs, enabling reuse across multiple deployments.
+
+### Credential Management


For downloading from S3, it will not be via a Secret but via IAM (similar for other object stores), so we should be thoughtful on how this will work. It should not require someone to inject IAM credentials here, and instead should be getting it via normal means (e.g. IRSA in AWS)

itay · 2025-10-08T00:06:19Z

model-crd/dynamo-model-crd.md

+### Status-Based Readiness
+DynamoModel MUST expose a status field indicating readiness states (`Pending`, `Downloading`, `Ready`, `Failed`). Dependent resources (DGD, AIperf Job) SHOULD be able to wait for `Ready` state before proceeding.
+
+### Storage Persistence


Per my overall comment, let's re-think whether this is a true requirement.

itay · 2025-10-08T00:07:11Z

model-crd/dynamo-model-crd.md

+      name: huggingface-token
+      key: token
+  # Storage configuration
+  storage:


I think it's worth a discussion on whether we want the DynamoModel to be in charge of setting up the PVC or not.

Another q: If user deletes the DynamoModel CR, does model in PVC get deleted? what's the expected logic here?

another: can multiple models be stored in a single PVC - or do we see this as a pvc per model?

itay · 2025-10-08T00:07:52Z

model-crd/dynamo-model-crd.md

+    VllmPrefillWorker:
+      modelRef:
+        name: llama-3-70b-instruct-v1
+        mountPath: /models              # Where to mount in container


I feel like this might be insufficient? For example, how will the launched worker know what the model key is to load and register in the MDC?

itay · 2025-10-08T00:08:59Z

Also, I think it would be helpful to see how other systems do similar things and where they differ. Some that we can look at: AIBrix, Arks, OME

Adding a section on this would be valuable.

athreesh · 2025-10-08T02:53:06Z

Imagine we don't have a shared ReadWriteMany PVC and instead only have node-local storage. How would we utilize the CR?
double clicking on @itay feedback -- lot of cloud k8s envs don't support RWX. it would suck if customer's k8s environment didn't allow for this, and now they're immediately blocked

athreesh · 2025-10-08T02:58:31Z

model-crd/dynamo-model-crd.md

+    #   subPath: llama-3-70b
+
+  # Optional: Download configuration (defaults to HF Downloader or Base Dynamo image with HF)
+  downloader:


we might want to consider securityContext in this spec to make this JOb work on Openshift-style envs, that way you can run as non-root user. not a blocking issue

yes good point - we can add extraPodSpec to allow users set security context, tolerations etc.

athreesh · 2025-10-08T02:59:42Z

are we considering checksums in the scope of this? np if not in scope

athreesh · 2025-10-08T03:00:53Z

overall LGTM, great proposal @biswapanda. Left comments above; suggest we think strongly about the dependency on RWX

KavinKrishnan · 2025-10-08T06:27:15Z

lgtm as well overall

we should definitely persist this metadata in CRDs when dynamo is running k8s

already mentioned to you @biswapanda, but wanted to guage others opinions here as well:

I am wondering if we should have model express persist this metadata as well since it is maintaining a dedicated database outside of etcd. This would decouple the dependency to run dynamo on k8s (and have etcd) in order to have access to this metadata.

nnshah1 · 2025-10-08T06:49:59Z

model-crd/dynamo-model-crd.md

+
+Currently, Dynamo users face three critical challenges:
+
+1. **Model Version Drift**: Inconsistent behavior occurs when AI-perf benchmarks use different model versions than deployments. This was observed during 70B model benchmarking where the deployment used stale weights while the benchmark job pulled the latest commit from HuggingFace.


question: when we say the benchmark job used the latest - is this referencing the tokenizer for ai perf for example mismatched with the weights of the deployment? Not sure I follow what 'drift' means here - maybe 'mismatch' is a term - where different components can have different versions of the model?

yes I meant unintentional version mismatch by pointing to main revision of a huggingface hub model.

There was tokenizer/config json changes while weights remained the same.

In this case aiperf was epehmeral job running on main ToT and deployment was an older snapshot of main of a huggingface hub model

nnshah1 · 2025-10-08T06:51:37Z

model-crd/dynamo-model-crd.md

+
+1. **Model Version Drift**: Inconsistent behavior occurs when AI-perf benchmarks use different model versions than deployments. This was observed during 70B model benchmarking where the deployment used stale weights while the benchmark job pulled the latest commit from HuggingFace.
+
+2. **No Cross-Deployment/perf job Model Reuse**: Multiple DGDs or aiperf jobs cannot easily share the same model weights, leading to duplicated operational overhead managing PVCs, secrets, and Jobs.


question: does this mean that multiple deployments can't share a PVC or other shared storage for weights? would model express solve this?

yes, there is a path to use model express. But overall k8s spec is a contract/interface and user can

DynamoModel CRD is orthogonal and works with modelexpress ~

enable model express

bring their own model registry (some folks used JFrog. ML Flow etc..)

directly download from HF

biswapanda · 2025-10-08T23:16:52Z

checksums

Added an additional feature for model validation ~ we need a mechanism to generate checksums for a commit. Currently Huggignface doesn't provide this OOTB.

biswapanda self-assigned this Oct 7, 2025

biswapanda changed the title ~~proposal for dynamo model CR~~ DynamoModel: Kubernetes Custom Resource to simplify Model Lifecycle Management UX Oct 7, 2025

biswapanda changed the title ~~DynamoModel: Kubernetes Custom Resource to simplify Model Lifecycle Management UX~~ DynamoModel: Kubernetes Custom Resource to simplify Model Lifecycle Management Oct 7, 2025

update

9cd4ff8

biswapanda force-pushed the bis/model-management branch from 2e4fb3f to 9cd4ff8 Compare October 7, 2025 19:04

biswapanda added 7 commits October 7, 2025 12:12

update

bdff80a

update

35e3616

upadte

96bf09b

update

5ced8aa

update

291e83e

update

5266d5a

update

20f574a

julienmancuso reviewed Oct 7, 2025

View reviewed changes

itay reviewed Oct 8, 2025

View reviewed changes

athreesh reviewed Oct 8, 2025

View reviewed changes

nnshah1 reviewed Oct 8, 2025

View reviewed changes


		Currently, Dynamo users face three critical challenges:

		1. Model Version Drift: Inconsistent behavior occurs when AI-perf benchmarks use different model versions than deployments. This was observed during 70B model benchmarking where the deployment used stale weights while the benchmark job pulled the latest commit from HuggingFace.


		1. Model Version Drift: Inconsistent behavior occurs when AI-perf benchmarks use different model versions than deployments. This was observed during 70B model benchmarking where the deployment used stale weights while the benchmark job pulled the latest commit from HuggingFace.

		2. No Cross-Deployment/perf job Model Reuse: Multiple DGDs or aiperf jobs cannot easily share the same model weights, leading to duplicated operational overhead managing PVCs, secrets, and Jobs.

DynamoModel: Kubernetes Custom Resource to simplify Model Lifecycle Management #45

Are you sure you want to change the base?

DynamoModel: Kubernetes Custom Resource to simplify Model Lifecycle Management #45

Conversation

biswapanda commented Oct 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

itay commented Oct 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

itay commented Oct 8, 2025

Uh oh!

athreesh commented Oct 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

athreesh commented Oct 8, 2025

Uh oh!

athreesh commented Oct 8, 2025

Uh oh!

KavinKrishnan commented Oct 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

biswapanda commented Oct 8, 2025

Uh oh!

Uh oh!