Skip to content

Feat: shared annotation#2

Draft
cgetzen wants to merge 9 commits intomainfrom
feat/shared-annotation
Draft

Feat: shared annotation#2
cgetzen wants to merge 9 commits intomainfrom
feat/shared-annotation

Conversation

@cgetzen
Copy link
Collaborator

@cgetzen cgetzen commented Feb 2, 2026

Summary

Problem

Slurm-bridge does not support colocating multiple pods on a single multi-GPU node, resulting in underutilization when workloads require fewer GPUs than the node provides.

Solution

This adds an optional workload annotation slurmjob.slinky.slurm.net/shared accepting Slurm shared policy values (mcs, none, oversubscribe, topo, user) on workloads that have a 1:1 relationship between slurm jobs and pods. This excludes PodGroup and LeaderWorkerSet resources.

The admission controller ensures correctness:

  • validates the annotation value
  • ensures the annotation is immutable once the placeholder slurm job is running
  • ensures that it is only applied onto accepted workloads

The scheduler then applies the "shared" setting when creating the slurm job.

Limitations

Allowing group workloads to use the shared annotation is out of scope.

Group workloads use a single placeholder job for multiple pods with a fixed node count and one-node-per-pod assignment. Allowing shared on them would require supporting Slurm packing (fewer nodes than pods), which would require changes to PostFilter, submitJob node count, and annotatePodsWithNodes.

Using group workloads with DRA poses additional challenges. Slurm-bridge currently assumes one pod per node per job: PreBind is called per-pod with (pod, nodeName), and GetResources(ctx, pod, nodeName) returns the job’s allocation on that node from Slurm’s NodeResourceLayout. One ResourceClaim is created per pod for that full allocation. With multiple pods on the same node, each pod should only receive a portion of the job's allocation.

Breaking Changes

All existing behavior is maintained by default. Only workloads that opt in to using slurmjob.slinky.slurm.net/shared are affected.

Testing Notes

Additional Context

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant