-
Notifications
You must be signed in to change notification settings - Fork 5
DynamoModel: Kubernetes Custom Resource to simplify Model Lifecycle Management #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
9cd4ff8
bdff80a
35e3616
96bf09b
5ced8aa
291e83e
5266d5a
20f574a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,342 @@ | ||
# DynamoModel: Kubernetes Custom Resource to simplify Model Lifecycle Management UX | ||
|
||
**Status**: In-review | ||
|
||
**Authors**: [biswapanda](https://github.com/biswapanda) | ||
|
||
**Category**: Architecture | ||
|
||
**Required Reviewers**: [Maksim, Itay, Anish, Ganesh, Neelay, Kavin] | ||
|
||
**Review Date**: [targeted: Oct 9, 2025] | ||
|
||
|
||
**Slack thread**: [link](https://nvidia.slack.com/archives/C06850J381Y/p1758647954211439?thread_ts=1758613631.539669&cid=C06850J381Y) | ||
|
||
# Summary | ||
|
||
This proposal introduces `DynamoModel`, a dedicated Kubernetes Custom Resource (CR) for managing model lifecycle in the Dynamo ecosystem. DynamoModel decouples model downloading, versioning, and caching from DynamoGraphDeployment (DGD), enabling consistent model references across deployments, benchmarks, and services while eliminating boilerplate code and preventing model version drift. | ||
|
||
# Motivation | ||
|
||
Currently, Dynamo users face three critical challenges: | ||
|
||
1. **Model Version Drift**: Inconsistent behavior occurs when AI-perf benchmarks use different model versions than deployments. This was observed during 70B model benchmarking where the deployment used stale weights while the benchmark job pulled the latest commit from HuggingFace. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. question: when we say the benchmark job used the latest - is this referencing the tokenizer for ai perf for example mismatched with the weights of the deployment? Not sure I follow what 'drift' means here - maybe 'mismatch' is a term - where different components can have different versions of the model? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes I meant unintentional version mismatch by pointing to There was tokenizer/config json changes while weights remained the same. In this case aiperf was epehmeral job running on main ToT and deployment was an older snapshot of |
||
|
||
2. **No Cross-Deployment/perf job Model Reuse**: Multiple DGDs or aiperf jobs cannot easily share the same model weights, leading to duplicated operational overhead managing PVCs, secrets, and Jobs. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. question: does this mean that multiple deployments can't share a PVC or other shared storage for weights? would model express solve this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, there is a path to use model express. But overall k8s spec is a contract/interface and user can DynamoModel CRD is orthogonal and works with
|
||
|
||
3. **Boilerplate Code**: Each deployment requires *manual* setup of PVCs, secrets, and Jobs to download models before starting DGD, adding complexity and maintenance burden. | ||
|
||
These issues stem from tightly coupling model management with deployment lifecycle, making it difficult to: | ||
- Pin specific model versions across the ecosystem | ||
- Share models between multiple deployments and benchmarks | ||
- Verify model weights readiness before starting workers (Currently, this is done by users manually) | ||
|
||
## Goals | ||
|
||
- Decouple model lifecycle from DynamoGraphDeployment lifecycle | ||
- Enable model version pinning and eliminate version drift | ||
- Provide model sharing across multiple DynamoGraphDeployments and aiperf jobs | ||
- Simplify model download operations through operator-managed automation | ||
- Ensure services or aiperf workers only start after model weights are fully downloaded and verified | ||
|
||
### Non Goals | ||
|
||
- Providing model registry functionality (models still sourced from HF/S3/NGC) | ||
|
||
# Requirements | ||
|
||
### Model Source Flexibility | ||
DynamoModel MUST support multiple model sources including HuggingFace Hub, S3-compatible storage, NVIDIA NGC, and local file systems. The CR MUST use URI schemes (e.g., `hf://`, `s3://`, `ngc://`, `file://`) to specify sources. | ||
|
||
### Version Immutability | ||
Once a DynamoModel CR references a specific model version (e.g., HuggingFace commit SHA), that version MUST NOT change unless the CR is explicitly updated. This ensures deployment consistency. | ||
|
||
### Status-Based Readiness | ||
DynamoModel MUST expose a status field indicating readiness states (`Pending`, `Downloading`, `Ready`, `Failed`). Dependent resources (DGD, AIperf Job) SHOULD be able to wait for `Ready` state before proceeding. | ||
|
||
### Storage Persistence | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Per my overall comment, let's re-think whether this is a true requirement. |
||
Downloaded model weights MUST be stored in Persistent Volume Claims (PVCs) that persist beyond the lifecycle of individual DGDs, enabling reuse across multiple deployments. | ||
|
||
### Credential Management | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For downloading from S3, it will not be via a Secret but via IAM (similar for other object stores), so we should be thoughtful on how this will work. It should not require someone to inject IAM credentials here, and instead should be getting it via normal means (e.g. IRSA in AWS) |
||
DynamoModel MUST support Kubernetes Secret references for authenticated model sources (private HuggingFace repos, S3 buckets with credentials). | ||
|
||
# Proposal | ||
|
||
## DynamoModel Custom Resource Definition | ||
|
||
```yaml | ||
apiVersion: nvidia.com/v1alpha1 | ||
kind: DynamoModel | ||
metadata: | ||
name: llama-3-70b-instruct-v1 | ||
namespace: dynamo-system | ||
spec: | ||
# Model identification | ||
modelName: meta-llama/Llama-3.3-70B-Instruct | ||
version: 8a4556b53a7d81d7e07db15eafb5af5dcd321b33 # HuggingFace commit SHA | ||
# Source configuration | ||
source: | ||
uri: hf://meta-llama/Llama-3.3-70B-Instruct | ||
secretRef: | ||
name: huggingface-token | ||
key: token | ||
# Storage configuration | ||
storage: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's worth a discussion on whether we want the DynamoModel to be in charge of setting up the PVC or not. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another q: If user deletes the DynamoModel CR, does model in PVC get deleted? what's the expected logic here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. another: can multiple models be stored in a single PVC - or do we see this as a pvc per model? |
||
pvc: | ||
create: true # Auto-create PVC | ||
name: llama-3-70b-instruct-v1-pvc # Optional explicit name override defaults to <cr-name>-pvc | ||
storageClassName: fast-nvme # Simple field for convenience | ||
size: 150Gi # Simple field for convenience | ||
accessModes: | ||
- ReadWriteMany | ||
extraPvcSpec: {} | ||
# OR reference existing PVC | ||
# pvc: | ||
# name: existing-model-cache | ||
# subPath: llama-3-70b | ||
|
||
# Optional: Download configuration (defaults to HF Downloader or Base Dynamo image with HF) | ||
downloader: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we might want to consider securityContext in this spec to make this JOb work on Openshift-style envs, that way you can run as non-root user. not a blocking issue There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes good point - we can add |
||
image: my-registry/hf-downloader:my-tag # HF Downloader | ||
resources: {} | ||
retryLimit: 5 | ||
timeout: 3600s | ||
``` | ||
Status is updated as follows after the model is downloaded: | ||
```yaml | ||
status: | ||
phase: Ready # Pending | Downloading | Ready | Failed | ||
conditions: | ||
- type: Downloaded | ||
status: "True" | ||
lastTransitionTime: "2025-10-07T10:30:00Z" | ||
reason: DownloadComplete | ||
message: "Model downloaded successfully" | ||
# Storage details | ||
storageRef: | ||
pvcName: llama-3-70b-instruct-v1-pvc | ||
path: /models/llama-3-70b-instruct-v1 | ||
# Metadata | ||
modelSize: 140Gi | ||
downloadStartTime: "2025-10-07T10:00:00Z" | ||
downloadCompleteTime: "2025-10-07T10:30:00Z" | ||
lastAccessTime: "2025-10-07T12:15:00Z" | ||
# Usage tracking | ||
referencedBy: | ||
- kind: DynamoGraphDeployment | ||
name: vllm-disagg | ||
namespace: dynamo-system | ||
``` | ||
## DynamoGraphDeployment Integration | ||
DGDs reference models using `modelRef`: | ||
|
||
```yaml | ||
apiVersion: nvidia.com/v1alpha1 | ||
kind: DynamoGraphDeployment | ||
metadata: | ||
name: vllm-disagg | ||
namespace: dynamo-system | ||
spec: | ||
services: | ||
VllmPrefillWorker: | ||
modelRef: | ||
name: llama-3-70b-instruct-v1 | ||
mountPath: /models # Where to mount in container | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I feel like this might be insufficient? For example, how will the launched worker know what the model key is to load and register in the MDC? |
||
replicas: 2 | ||
image: my-registry/vllm:my-tag | ||
VllmDecodeWorker: | ||
modelRef: | ||
name: llama-3-70b-instruct-v1 | ||
mountPath: /models | ||
replicas: 4 | ||
``` | ||
|
||
# Lifecycle | ||
|
||
## DynamoModel Lifecycle | ||
 | ||
DynamoModel States and Transitions | ||
|
||
State: Created | ||
- Transition: Created -> Pending | ||
When the DynamoModel resource is created and accepted by the controller | ||
|
||
State: Pending | ||
- Transition: Pending -> Downloading | ||
When the controller starts the model download job | ||
|
||
State: Downloading | ||
- Transition: Downloading -> Ready | ||
When the model is downloaded successfully | ||
- Transition: Downloading -> Failed | ||
When the model download fails | ||
|
||
State: Failed | ||
- Transition: Failed -> Downloading | ||
When a retry is triggered after a failure | ||
|
||
State: Ready | ||
- Transition: Ready -> Deleted | ||
When the DynamoModel resource is deleted | ||
|
||
State: Deleted | ||
Terminal state; no further transitions | ||
|
||
## DGD Lifecycle with Model Dependencies | ||
|
||
 | ||
|
||
### DGD Controller Changes | ||
|
||
The existing DynamoGraphDeployment controller needs modifications: | ||
|
||
1. **Model Reference Resolution**: When a service spec contains `modelRef`, resolve it to the actual DynamoModel CR | ||
2. **Readiness Gating**: Before creating worker Deployments, check that the referenced model's `Ready` condition is `True` | ||
3. **PVC Mounting**: Automatically mount the model's PVC to worker pods | ||
4. **Environment Variables**: Set `MODEL_PATH` environment variable to the model's mount path | ||
5. **Reference Counting**: Increment/decrement the model's `referenceCount` when DGDs are created/deleted | ||
6. **Watch Events**: Watch for DynamoModel status changes to trigger DGD reconciliation | ||
|
||
|
||
|
||
# Model download job flow | ||
Dynmao Operator will launch a job with the following env variables: | ||
- `MODEL_PATH` set to the model's mount path | ||
- `MODEL_NAME` set to the model's name | ||
- `MODEL_VERSION` set to the model's version | ||
|
||
#### Dynamo model CR | ||
```yaml | ||
apiVersion: nvidia.com/v1alpha1 | ||
kind: DynamoModel | ||
metadata: | ||
name: llama-3-70b-instruct-v1 | ||
namespace: dynamo-system | ||
spec: | ||
# Model identification | ||
modelName: meta-llama/Llama-3.3-70B-Instruct | ||
version: 8a4556b53a7d81d7e07db15eafb5af5dcd321b33 # HuggingFace commit SHA | ||
# Source configuration | ||
source: | ||
uri: hf://meta-llama/Llama-3.3-70B-Instruct | ||
secretRef: | ||
name: huggingface-token | ||
key: token | ||
``` | ||
|
||
## Model Express disabled | ||
Operator will launch a job with huggingface hub client image with huggingface token from `spec.source.secretRef` name and key. | ||
- `HF_TOKEN` set to the huggingface token | ||
|
||
|
||
#### Model Download Job Spec | ||
```yaml | ||
apiVersion: batch/v1 | ||
kind: Job | ||
metadata: | ||
name: download-model-job | ||
namespace: <namespace> | ||
spec: | ||
template: | ||
spec: | ||
containers: | ||
- name: download-model | ||
image: huggingface-downloader:latest | ||
env: | ||
- name: MODEL_PATH | ||
value: /models/llama-3-70b-instruct-v1 | ||
- name: MODEL_NAME | ||
value: llama-3-70b-instruct-v1 | ||
- name: MODEL_VERSION | ||
value: 8a4556b53a7d81d7e07db15eafb5af5dcd321b33 | ||
- name: HF_TOKEN | ||
value: <huggingface-token> | ||
volumeMounts: | ||
- name: model-pvc | ||
mountPath: /models | ||
volumes: | ||
- name: model-pvc | ||
persistentVolumeClaim: | ||
claimName: llama-3-70b-instruct-v1-pvc | ||
``` | ||
|
||
## Model Express enabled | ||
When Model express is enabled, operator will launch a job with model express image with additional env variable: | ||
- `MODEL_EXPRESS_URL` set to the model express server url (cluster internal url) | ||
|
||
### Example: Model Download Job Spec | ||
|
||
```yaml | ||
apiVersion: batch/v1 | ||
kind: Job | ||
metadata: | ||
name: download-model-job | ||
namespace: <namespace> | ||
spec: | ||
template: | ||
spec: | ||
containers: | ||
- name: download-model | ||
image: model-express-downloader:latest | ||
env: | ||
- name: MODEL_PATH | ||
value: /models/llama-3-70b-instruct-v1 | ||
- name: MODEL_NAME | ||
value: llama-3-70b-instruct-v1 | ||
- name: MODEL_VERSION | ||
value: 8a4556b53a7d81d7e07db15eafb5af5dcd321b33 | ||
- name: MODEL_EXPRESS_URL | ||
value: http://model-express.<namespace>.svc.cluster.local | ||
volumeMounts: | ||
- name: model-pvc | ||
mountPath: /models | ||
volumes: | ||
- name: model-pvc | ||
persistentVolumeClaim: | ||
claimName: llama-3-70b-instruct-v1-pvc | ||
``` | ||
|
||
# Benefits | ||
|
||
- Eliminates boilerplate (PVC/Job init) by centralizing model operations in the operator | ||
- Prevents model version drift with immutable version pinning | ||
- Enables sharing across DGDs and aiperf jobs (single PVC, multiple mounts) | ||
- Improves observability via status conditions | ||
- Extensible to multiple sources (HF/S3/NGC/File) and future features (LoRA, air-gapped deployments from private model registries) | ||
|
||
|
||
# Additional Features | ||
|
||
- Model verification: | ||
- We can add verification of the entire folder (sorted by file path) | ||
- Problem: HF doesn't provide folder checksums - these neeed to be pre-computed | ||
- verification: | ||
- enabled: true | ||
- checksum: sha256:abc123 | ||
|
||
```yaml | ||
apiVersion: nvidia.com/v1alpha1 | ||
kind: DynamoModel | ||
metadata: | ||
name: llama-3-70b-instruct-v1 | ||
namespace: dynamo-system | ||
spec: | ||
# Additional verification | ||
verification: | ||
enabled: true | ||
checksum: sha256:abc123 | ||
status: | ||
conditions: | ||
- type: Verified | ||
status: "True" | ||
lastTransitionTime: "2025-10-07T10:30:00Z" | ||
reason: ChecksumValid | ||
message: "Model verification passed" | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the DGD controller shouldn't launch the model job.
The new DynamoModel should be in charge of reconciling all DynamoModel and launching jobs for models.
DGD reconciliation loop should requeue until DynamoModel is ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, I'll fix the diagram.