Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Problem
Slurm-bridge does not support colocating multiple pods on a single multi-GPU node, resulting in underutilization when workloads require fewer GPUs than the node provides.
Solution
This adds an optional workload annotation
slurmjob.slinky.slurm.net/sharedaccepting Slurm shared policy values (mcs,none,oversubscribe,topo,user) on workloads that have a 1:1 relationship between slurm jobs and pods. This excludes PodGroup and LeaderWorkerSet resources.The admission controller ensures correctness:
The scheduler then applies the "shared" setting when creating the slurm job.
Limitations
Allowing group workloads to use the shared annotation is out of scope.
Group workloads use a single placeholder job for multiple pods with a fixed node count and one-node-per-pod assignment. Allowing shared on them would require supporting Slurm packing (fewer nodes than pods), which would require changes to PostFilter, submitJob node count, and annotatePodsWithNodes.
Using group workloads with DRA poses additional challenges. Slurm-bridge currently assumes one pod per node per job: PreBind is called per-pod with
(pod, nodeName), andGetResources(ctx, pod, nodeName)returns the job’s allocation on that node from Slurm’s NodeResourceLayout. One ResourceClaim is created per pod for that full allocation. With multiple pods on the same node, each pod should only receive a portion of the job's allocation.Breaking Changes
All existing behavior is maintained by default. Only workloads that opt in to using
slurmjob.slinky.slurm.net/sharedare affected.Testing Notes
Additional Context