Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added model-crd/dyn_model_lifecycle.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added model-crd/dynamo-dgd-lifecycle.png

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the DGD controller shouldn't launch the model job.
The new DynamoModel should be in charge of reconciling all DynamoModel and launching jobs for models.
DGD reconciliation loop should requeue until DynamoModel is ready.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, I'll fix the diagram.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
342 changes: 342 additions & 0 deletions model-crd/dynamo-model-crd.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,342 @@
# DynamoModel: Kubernetes Custom Resource to simplify Model Lifecycle Management UX

**Status**: In-review

**Authors**: [biswapanda](https://github.com/biswapanda)

**Category**: Architecture

**Required Reviewers**: [Maksim, Itay, Anish, Ganesh, Neelay, Kavin]

**Review Date**: [targeted: Oct 9, 2025]


**Slack thread**: [link](https://nvidia.slack.com/archives/C06850J381Y/p1758647954211439?thread_ts=1758613631.539669&cid=C06850J381Y)

# Summary

This proposal introduces `DynamoModel`, a dedicated Kubernetes Custom Resource (CR) for managing model lifecycle in the Dynamo ecosystem. DynamoModel decouples model downloading, versioning, and caching from DynamoGraphDeployment (DGD), enabling consistent model references across deployments, benchmarks, and services while eliminating boilerplate code and preventing model version drift.

# Motivation

Currently, Dynamo users face three critical challenges:

1. **Model Version Drift**: Inconsistent behavior occurs when AI-perf benchmarks use different model versions than deployments. This was observed during 70B model benchmarking where the deployment used stale weights while the benchmark job pulled the latest commit from HuggingFace.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: when we say the benchmark job used the latest - is this referencing the tokenizer for ai perf for example mismatched with the weights of the deployment? Not sure I follow what 'drift' means here - maybe 'mismatch' is a term - where different components can have different versions of the model?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I meant unintentional version mismatch by pointing to main revision of a huggingface hub model.

There was tokenizer/config json changes while weights remained the same.

In this case aiperf was epehmeral job running on main ToT and deployment was an older snapshot of main of a huggingface hub model


2. **No Cross-Deployment/perf job Model Reuse**: Multiple DGDs or aiperf jobs cannot easily share the same model weights, leading to duplicated operational overhead managing PVCs, secrets, and Jobs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: does this mean that multiple deployments can't share a PVC or other shared storage for weights? would model express solve this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, there is a path to use model express. But overall k8s spec is a contract/interface and user can

DynamoModel CRD is orthogonal and works with modelexpress ~

  • enable model express
  • bring their own model registry (some folks used JFrog. ML Flow etc..)
  • directly download from HF


3. **Boilerplate Code**: Each deployment requires *manual* setup of PVCs, secrets, and Jobs to download models before starting DGD, adding complexity and maintenance burden.

These issues stem from tightly coupling model management with deployment lifecycle, making it difficult to:
- Pin specific model versions across the ecosystem
- Share models between multiple deployments and benchmarks
- Verify model weights readiness before starting workers (Currently, this is done by users manually)

## Goals

- Decouple model lifecycle from DynamoGraphDeployment lifecycle
- Enable model version pinning and eliminate version drift
- Provide model sharing across multiple DynamoGraphDeployments and aiperf jobs
- Simplify model download operations through operator-managed automation
- Ensure services or aiperf workers only start after model weights are fully downloaded and verified

### Non Goals

- Providing model registry functionality (models still sourced from HF/S3/NGC)

# Requirements

### Model Source Flexibility
DynamoModel MUST support multiple model sources including HuggingFace Hub, S3-compatible storage, NVIDIA NGC, and local file systems. The CR MUST use URI schemes (e.g., `hf://`, `s3://`, `ngc://`, `file://`) to specify sources.

### Version Immutability
Once a DynamoModel CR references a specific model version (e.g., HuggingFace commit SHA), that version MUST NOT change unless the CR is explicitly updated. This ensures deployment consistency.

### Status-Based Readiness
DynamoModel MUST expose a status field indicating readiness states (`Pending`, `Downloading`, `Ready`, `Failed`). Dependent resources (DGD, AIperf Job) SHOULD be able to wait for `Ready` state before proceeding.

### Storage Persistence
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my overall comment, let's re-think whether this is a true requirement.

Downloaded model weights MUST be stored in Persistent Volume Claims (PVCs) that persist beyond the lifecycle of individual DGDs, enabling reuse across multiple deployments.

### Credential Management
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For downloading from S3, it will not be via a Secret but via IAM (similar for other object stores), so we should be thoughtful on how this will work. It should not require someone to inject IAM credentials here, and instead should be getting it via normal means (e.g. IRSA in AWS)

DynamoModel MUST support Kubernetes Secret references for authenticated model sources (private HuggingFace repos, S3 buckets with credentials).

# Proposal

## DynamoModel Custom Resource Definition

```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
name: llama-3-70b-instruct-v1
namespace: dynamo-system
spec:
# Model identification
modelName: meta-llama/Llama-3.3-70B-Instruct
version: 8a4556b53a7d81d7e07db15eafb5af5dcd321b33 # HuggingFace commit SHA
# Source configuration
source:
uri: hf://meta-llama/Llama-3.3-70B-Instruct
secretRef:
name: huggingface-token
key: token
# Storage configuration
storage:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth a discussion on whether we want the DynamoModel to be in charge of setting up the PVC or not.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another q: If user deletes the DynamoModel CR, does model in PVC get deleted? what's the expected logic here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another: can multiple models be stored in a single PVC - or do we see this as a pvc per model?

pvc:
create: true # Auto-create PVC
name: llama-3-70b-instruct-v1-pvc # Optional explicit name override defaults to <cr-name>-pvc
storageClassName: fast-nvme # Simple field for convenience
size: 150Gi # Simple field for convenience
accessModes:
- ReadWriteMany
extraPvcSpec: {}
# OR reference existing PVC
# pvc:
# name: existing-model-cache
# subPath: llama-3-70b

# Optional: Download configuration (defaults to HF Downloader or Base Dynamo image with HF)
downloader:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to consider securityContext in this spec to make this JOb work on Openshift-style envs, that way you can run as non-root user. not a blocking issue

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes good point - we can add extraPodSpec to allow users set security context, tolerations etc.

image: my-registry/hf-downloader:my-tag # HF Downloader
resources: {}
retryLimit: 5
timeout: 3600s
```
Status is updated as follows after the model is downloaded:
```yaml
status:
phase: Ready # Pending | Downloading | Ready | Failed
conditions:
- type: Downloaded
status: "True"
lastTransitionTime: "2025-10-07T10:30:00Z"
reason: DownloadComplete
message: "Model downloaded successfully"
# Storage details
storageRef:
pvcName: llama-3-70b-instruct-v1-pvc
path: /models/llama-3-70b-instruct-v1
# Metadata
modelSize: 140Gi
downloadStartTime: "2025-10-07T10:00:00Z"
downloadCompleteTime: "2025-10-07T10:30:00Z"
lastAccessTime: "2025-10-07T12:15:00Z"
# Usage tracking
referencedBy:
- kind: DynamoGraphDeployment
name: vllm-disagg
namespace: dynamo-system
```
## DynamoGraphDeployment Integration
DGDs reference models using `modelRef`:

```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: vllm-disagg
namespace: dynamo-system
spec:
services:
VllmPrefillWorker:
modelRef:
name: llama-3-70b-instruct-v1
mountPath: /models # Where to mount in container
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this might be insufficient? For example, how will the launched worker know what the model key is to load and register in the MDC?

replicas: 2
image: my-registry/vllm:my-tag
VllmDecodeWorker:
modelRef:
name: llama-3-70b-instruct-v1
mountPath: /models
replicas: 4
```

# Lifecycle

## DynamoModel Lifecycle
![DynamoModel Lifecycle](./dyn_model_lifecycle.png)
DynamoModel States and Transitions

State: Created
- Transition: Created -> Pending
When the DynamoModel resource is created and accepted by the controller

State: Pending
- Transition: Pending -> Downloading
When the controller starts the model download job

State: Downloading
- Transition: Downloading -> Ready
When the model is downloaded successfully
- Transition: Downloading -> Failed
When the model download fails

State: Failed
- Transition: Failed -> Downloading
When a retry is triggered after a failure

State: Ready
- Transition: Ready -> Deleted
When the DynamoModel resource is deleted

State: Deleted
Terminal state; no further transitions

## DGD Lifecycle with Model Dependencies

![DGD Lifecycle with Model Dependencies](./dynamo-dgd-lifecycle.png)

### DGD Controller Changes

The existing DynamoGraphDeployment controller needs modifications:

1. **Model Reference Resolution**: When a service spec contains `modelRef`, resolve it to the actual DynamoModel CR
2. **Readiness Gating**: Before creating worker Deployments, check that the referenced model's `Ready` condition is `True`
3. **PVC Mounting**: Automatically mount the model's PVC to worker pods
4. **Environment Variables**: Set `MODEL_PATH` environment variable to the model's mount path
5. **Reference Counting**: Increment/decrement the model's `referenceCount` when DGDs are created/deleted
6. **Watch Events**: Watch for DynamoModel status changes to trigger DGD reconciliation



# Model download job flow
Dynmao Operator will launch a job with the following env variables:
- `MODEL_PATH` set to the model's mount path
- `MODEL_NAME` set to the model's name
- `MODEL_VERSION` set to the model's version

#### Dynamo model CR
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
name: llama-3-70b-instruct-v1
namespace: dynamo-system
spec:
# Model identification
modelName: meta-llama/Llama-3.3-70B-Instruct
version: 8a4556b53a7d81d7e07db15eafb5af5dcd321b33 # HuggingFace commit SHA
# Source configuration
source:
uri: hf://meta-llama/Llama-3.3-70B-Instruct
secretRef:
name: huggingface-token
key: token
```

## Model Express disabled
Operator will launch a job with huggingface hub client image with huggingface token from `spec.source.secretRef` name and key.
- `HF_TOKEN` set to the huggingface token


#### Model Download Job Spec
```yaml
apiVersion: batch/v1
kind: Job
metadata:
name: download-model-job
namespace: <namespace>
spec:
template:
spec:
containers:
- name: download-model
image: huggingface-downloader:latest
env:
- name: MODEL_PATH
value: /models/llama-3-70b-instruct-v1
- name: MODEL_NAME
value: llama-3-70b-instruct-v1
- name: MODEL_VERSION
value: 8a4556b53a7d81d7e07db15eafb5af5dcd321b33
- name: HF_TOKEN
value: <huggingface-token>
volumeMounts:
- name: model-pvc
mountPath: /models
volumes:
- name: model-pvc
persistentVolumeClaim:
claimName: llama-3-70b-instruct-v1-pvc
```

## Model Express enabled
When Model express is enabled, operator will launch a job with model express image with additional env variable:
- `MODEL_EXPRESS_URL` set to the model express server url (cluster internal url)

### Example: Model Download Job Spec

```yaml
apiVersion: batch/v1
kind: Job
metadata:
name: download-model-job
namespace: <namespace>
spec:
template:
spec:
containers:
- name: download-model
image: model-express-downloader:latest
env:
- name: MODEL_PATH
value: /models/llama-3-70b-instruct-v1
- name: MODEL_NAME
value: llama-3-70b-instruct-v1
- name: MODEL_VERSION
value: 8a4556b53a7d81d7e07db15eafb5af5dcd321b33
- name: MODEL_EXPRESS_URL
value: http://model-express.<namespace>.svc.cluster.local
volumeMounts:
- name: model-pvc
mountPath: /models
volumes:
- name: model-pvc
persistentVolumeClaim:
claimName: llama-3-70b-instruct-v1-pvc
```

# Benefits

- Eliminates boilerplate (PVC/Job init) by centralizing model operations in the operator
- Prevents model version drift with immutable version pinning
- Enables sharing across DGDs and aiperf jobs (single PVC, multiple mounts)
- Improves observability via status conditions
- Extensible to multiple sources (HF/S3/NGC/File) and future features (LoRA, air-gapped deployments from private model registries)


# Additional Features

- Model verification:
- We can add verification of the entire folder (sorted by file path)
- Problem: HF doesn't provide folder checksums - these neeed to be pre-computed
- verification:
- enabled: true
- checksum: sha256:abc123

```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
name: llama-3-70b-instruct-v1
namespace: dynamo-system
spec:
# Additional verification
verification:
enabled: true
checksum: sha256:abc123
status:
conditions:
- type: Verified
status: "True"
lastTransitionTime: "2025-10-07T10:30:00Z"
reason: ChecksumValid
message: "Model verification passed"
```