-
Notifications
You must be signed in to change notification settings - Fork 78
Add ComputeDomain for running multi-node workloads #225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
72 commits
Select commit
Hold shift + click to select a range
cdc471e
Fix tooling versions
elezar d5dba3a
Remove +clientgen from GpuConfig, MigConfig, and ImexChannelConfig
klueska 1bde3dc
Add CRD for creating a multi-node environment
klueska d04de6a
Update Makefile to generate MultiNodeEnvironment CRD, client, deepcopy
klueska 267acfc
Add generated MultiNodeEnvironment CRD, client, and deepcopy
klueska 04822e9
Make the nvidia.com client set available to the driver
klueska dbe6212
Add indirection of ImexManager through wrapping Controller abstraction
klueska 9fcf823
Add a workqueue abstraction for processing objects pulled from informers
klueska 9567a97
Add logic to autogenerate a ResourceClaim from a MultiNodeEnvironment
klueska c592906
Add logic to autogenerate a DeviceClass from a MultiNodeEnvironment
klueska 3e35d7e
Allow either a resourceClaimName or a deviceClassName to be specified
klueska 4e19938
Add Deployment support for MultiNodeEnvironments and completely refactor
klueska 252ef56
Rename and copy cmds / helm charts to split GPU and IMEX drivers
klueska 1599501
Strip GPU / IMEX drivers to remove corresponding devices
klueska e156811
Add ability to allocate a per-node IMEX deamon via a ResourceClaim
klueska def999d
Rename MultiNodeEnvironment to ComputeDomain
klueska a3ff033
Rename gpu.nvidia.com/v1alpha1 API to resource.nvidia.com/v1beta1
klueska 899a701
Move to GetComputeDomainFunc instead of ComputeDomainExistsFunc
klueska c7f9061
Add ComputeDomainStatus and set it as its deployment pods come online
klueska 62820f9
Update ImexDaemonSettingsManager to pull IPs from ComputeDomain status
klueska af8d432
Add the ability to set affinities as part of a ComputeDomain
klueska 619f4d9
Move creation of IMEX channel pool to after the deployment is fully up
klueska ab0c8e9
Add placeholder for Delayed vs. Immediate mode for ComputeDomain
klueska f8b3265
Support ResourceClaimNames as a list in a ComputeDomain
klueska a9c2e90
Update vendored nvidia-container-toolkit to main
klueska 38a5ab8
Remove explicit mounting of nvidia-imex and nvidia-imex-ctl
klueska b18e3ec
Add a finalizer to ComputeDomains to ensure they are the last removed
klueska 301086b
Add optimization to avoid redundant Delete calls
klueska 9d57625
Standardize on passing ComputeDomainUID to RemoveFinalizer calls
klueska 8edb57b
Remove unnecessary code to check for ComputeDomain existence
klueska b78aeae
Pull RemoveFinalizer() out of Delete() and call it conditionally
klueska 66e0680
Rename ImexDaemonSettingsManager and move CDI edits for Channeln there
klueska 701b674
Skip injection of the IMEX channel device node if no cliqueID
klueska 211bd3f
Add a status field for the ComputeDomain.Status
klueska f51c43c
Move channel pools out of deployment and into computedomain manager
klueska 6a7adc8
Update validating admission policy to support allNodes in resource slice
klueska cbe7c4c
Add support for Delayed mode in ComputeDomains
klueska 16671b6
Use workqueue to retry failed operations in plugin instead of retry loop
klueska 4e59eef
Rename k8s-dra-driver to k8s-dra-driver-gpu
klueska 0bbe353
Rename all helm references to 'fullname' to just 'name'
klueska 820fc47
Rename helm chart name from k8s-dra-driver-gpu to nvidia-dra-driver-gpu
klueska dcfb94d
Remove helm chart for GPU DRA driver in favor of one for ComputeDomains
klueska ee553a1
Rename containers and commands inside them in consolidated helm chart
klueska f3afd1a
Remove all (appropriate) references to the term IMEX
klueska 2956dba
Change to calendar versioning instead of semver
klueska 5e58318
Add condition to helm chart to enable computeDomains and GPUs separately
klueska 4b7441b
Advertise channel 0 from each node as a "default" channel to consume
klueska 25b59ec
Remove the ability to define a custom deviceclass for a computedomain
klueska a390f15
Make ComputeDomain specs immutable
klueska baffbb1
Add a struct around the list of resource claims and specify names there
klueska 381ee05
Limit Deployment's ResourceClaimTemplate to driver namespace
klueska d052490
Make the ResourceClaimTemplateManager specific to targeting daemons
klueska b082ab0
Move to ResourceClaimTemplates instead of global ResourceClaims
klueska 45e0177
Remove ImmediateMode and make Delayed the only option
klueska 75edd9b
Add waiting for dependent objects of ComputeDomain to be fully removed
klueska 7821db4
Use a daemonset instead of a deployment to run ComputeDomain daemons
klueska e9c7101
Block ComputeDomain deletion while a workload is still running in it
klueska 9cbc576
Add a liveness probe to the ComputeDomain daemon
klueska 3373134
Ensure that ResourceClaim / ComputeDomain namespace are the same
klueska 993b853
Add the notion of a "permanent" error to the kubelet plugin
klueska 7783596
Harden logic around calling prepare / unprepare on allocated claims
klueska 0e5611f
Abstract out getConfigResultsMap so it can be reused later
klueska ef25561
Unconditionally unprepare imex channels and daemons
klueska a37de39
Treat a ClusterUUID of all 0s to mean no IMEX support as well
klueska 4625953
Add a level of indiraction with a new 'channel' field in ComputeDomain
klueska 718e69d
Ensure that the fabric-imex-mgmt nvcap is created and injected always
klueska 5a83bac
Recursively unmount /proc/driver/nvidia if it is mounted
klueska 3ea7913
Add demo specs for working with compute domains
klueska 578ab87
Only inject channel / daemon settings if running on an IMEX capable node
klueska 4464442
Add periodic cleanup of stale objects owned by deleted ComputeDomains
klueska 222df11
Allow the DRA driver for GPUs to be force installed if desired
klueska 474f968
Determine cliqueID from NVML not node label
klueska File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,8 @@ | ||
.cache/ | ||
.bash_history | ||
/nvidia-dra-controller | ||
/nvidia-dra-plugin | ||
/compute-domain-controller | ||
/compute-domain-kubelet-plugin | ||
/gpu-kubelet-plugin | ||
.idea | ||
[._]*.sw[a-p] | ||
coverage.out |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,86 @@ | ||
/* | ||
* Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||
* you may not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package v1beta1 | ||
|
||
import ( | ||
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | ||
) | ||
|
||
const ( | ||
ComputeDomainStatusReady = "Ready" | ||
ComputeDomainStatusNotReady = "NotReady" | ||
) | ||
|
||
// +genclient | ||
// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object | ||
// +k8s:openapi-gen=true | ||
// +kubebuilder:resource:scope=Namespaced | ||
// +kubebuilder:subresource:status | ||
|
||
// ComputeDomain prepares a set of nodes to run a multi-node workload in. | ||
type ComputeDomain struct { | ||
metav1.TypeMeta `json:",inline"` | ||
metav1.ObjectMeta `json:"metadata,omitempty"` | ||
|
||
Spec ComputeDomainSpec `json:"spec,omitempty"` | ||
Status ComputeDomainStatus `json:"status,omitempty"` | ||
} | ||
|
||
// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object | ||
|
||
// ComputeDomainList provides a list of ComputeDomains. | ||
type ComputeDomainList struct { | ||
metav1.TypeMeta `json:",inline"` | ||
metav1.ListMeta `json:"metadata,omitempty"` | ||
|
||
Items []ComputeDomain `json:"items"` | ||
} | ||
|
||
// +kubebuilder:validation:XValidation:rule="self == oldSelf", message="A computeDomain.spec is immutable" | ||
|
||
// ComputeDomainSpec provides the spec for a ComputeDomain. | ||
type ComputeDomainSpec struct { | ||
NumNodes int `json:"numNodes"` | ||
Channel *ComputeDomainChannelSpec `json:"channel"` | ||
} | ||
|
||
// ComputeDomainChannelSpec provides the spec for a channel used to run a workload inside a ComputeDomain. | ||
type ComputeDomainChannelSpec struct { | ||
ResourceClaimTemplate ComputeDomainResourceClaimTemplate `json:"resourceClaimTemplate"` | ||
} | ||
|
||
// ComputeDomainResourceClaimTemplate provides the details of the ResourceClaimTemplate to generate. | ||
type ComputeDomainResourceClaimTemplate struct { | ||
Name string `json:"name"` | ||
} | ||
|
||
// ComputeDomainStatus provides the status for a ComputeDomain. | ||
type ComputeDomainStatus struct { | ||
// +kubebuilder:validation:Enum=Ready;NotReady | ||
// +kubebuilder:default=NotReady | ||
Status string `json:"status"` | ||
// +listType=map | ||
// +listMapKey=name | ||
Nodes []*ComputeDomainNode `json:"nodes,omitempty"` | ||
} | ||
|
||
// ComputeDomainNode provides information about each node added to a ComputeDomain. | ||
type ComputeDomainNode struct { | ||
Name string `json:"name"` | ||
IPAddress string `json:"ipAddress"` | ||
CliqueID string `json:"cliqueID"` | ||
} |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should channel be optional thinking on non imex use cases in the future, I know currently we are solely focused on imex support, but if we want to carry on the concept of computeDomain, we might face clusters without imex (channels)