Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ Slurm is a full featured HPC workload manager. To highlight a few features:

## Limitations

- Exclusive, whole node allocations are made for each pod.
- Exclusive, whole node allocations are made for each pod when using group workloads (PodGroups, LeaderWorkerSet).

## Installation

Expand All @@ -97,6 +97,10 @@ helm install slurm-bridge oci://ghcr.io/slinkyproject/charts/slurm-bridge \

For additional instructions, see the [quickstart] guide.

## Configuration

For setting up Slurm and slurm-bridge for certain use-cases, see [Configuration](docs/configuration.md).

## Documentation

Project documentation is located in the [docs] directory of this repository.
Expand Down
40 changes: 40 additions & 0 deletions docs/configuration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Configuration

## Pack Multiple Pods on a Node

Changes need to be made in Slurm and slurm-bridge to pack multiple pods on a node.

By default, Slurm will reserve a full node for each job. To enable packing, adjust slurm.conf:

```
# Set Oversubscribe to YES or FORCE
PartitionName=<name> ... OverSubscribe=YES
```

Optional tuning parameters:
```
# pack_serial_at_end: schedules serial jobs at the end of the backfill window to reduce fragmentation and improve packing.
# bf_busy_nodes: backfill scheduler prefers nodes that are already busy, packing jobs onto fewer nodes and leaving others idle for whole-node jobs. This only applies

SchedulerParameters=pack_serial_at_end,bf_busy_nodes
```

When using slinky, this can be set by adjusting its `values.yaml`:

```yaml
controller:
extraConf: |
SchedulerParameters=pack_serial_at_end,bf_busy_nodes
nodesets:
slinky: # or other nodeset name
partition:
config: |
OverSubscribe=YES
```

By default, slurm-bridge will schedule jobs with "shared: none". In order to allow jobs to share nodes, set the Pod's `slurmjob.slinky.slurm.net/shared` annotation to `user`.

For more details, see:
- [cons_tres resource sharing](https://slurm.schedmd.com/cons_tres_share.html).
- [Scheduler Params](https://slurm.schedmd.com/slurm.conf.html#OPT_SchedulerParameters)
- [Job settings](https://slurm.schedmd.com/sbatch.html)
1 change: 1 addition & 0 deletions docs/scheduler.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@ see the [annotations.go] source.
| slurmjob.slinky.slurm.net/max-nodes | Sets the maximum number of nodes. | "3" |
| slurmjob.slinky.slurm.net/mem-per-node | Sets the amount of memory. | "8Gi" |
| slurmjob.slinky.slurm.net/partition | Overrides the default partition. | "debug" |
| slurmjob.slinky.slurm.net/shared | Sets the shared policy. | "user" |

An example of the annotations in use:

Expand Down
34 changes: 34 additions & 0 deletions internal/admission/admission.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ import (
"sigs.k8s.io/controller-runtime/pkg/log"
"sigs.k8s.io/controller-runtime/pkg/webhook"
"sigs.k8s.io/controller-runtime/pkg/webhook/admission"
lwsv1 "sigs.k8s.io/lws/api/leaderworkerset/v1"
sched "sigs.k8s.io/scheduler-plugins/apis/scheduling/v1alpha1"
)

type PodAdmission struct {
Expand Down Expand Up @@ -86,6 +88,9 @@ func (r *PodAdmission) ValidateCreate(ctx context.Context, obj runtime.Object) (
if pod.Spec.ResourceClaims != nil {
return nil, fmt.Errorf("can't schedule a pod with a resourceclaim, use the annotation %s to request devices instead", wellknown.AnnotationGres)
}
if err := validateSharedAnnotation(pod); err != nil {
return nil, err
}
return nil, nil
}

Expand Down Expand Up @@ -113,6 +118,16 @@ func (r *PodAdmission) ValidateUpdate(ctx context.Context, oldObj runtime.Object
return nil, fmt.Errorf("can't update a running pod's placeholder node annotation")
}
}
// Once the Slurm placeholder job is running, the shared annotation should not be modified.
if newPod.Labels[wellknown.LabelPlaceholderJobId] != "" &&
newPod.Annotations[wellknown.AnnotationPlaceholderNode] != "" {
if oldPod.Annotations[wellknown.AnnotationShared] != newPod.Annotations[wellknown.AnnotationShared] {
return nil, fmt.Errorf("can't change shared annotation when the Slurm placeholder job is already running")
}
}
if err := validateSharedAnnotation(newPod); err != nil {
return nil, err
}
return nil, nil
}

Expand Down Expand Up @@ -140,3 +155,22 @@ func (r *PodAdmission) isManagedNamespace(ctx context.Context, namespace string)
}
return slices.Contains(r.ManagedNamespaces, namespace), nil
}

// validateSharedAnnotation validates the shared annotation value and rejects
// group workloads (PodGroup, LeaderWorkerSet).
func validateSharedAnnotation(pod *corev1.Pod) error {
value, ok := pod.Annotations[wellknown.AnnotationShared]
if !ok {
return nil
}
if err := wellknown.ValidateSharedValue(value); err != nil {
return err
}
if pod.Labels[sched.PodGroupLabel] != "" {
return fmt.Errorf("shared annotation is not allowed on PodGroup pods")
}
if pod.Labels[lwsv1.GroupUniqueHashLabelKey] != "" {
return fmt.Errorf("shared annotation is not allowed on LeaderWorkerSet pods")
}
return nil
}
Loading