Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README and licence #33

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
MIT License

Copyright (c) 2019 Giovanni Torres
Copyright (c) 2023 StackHPC

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
37 changes: 19 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,16 @@
# Slurm Docker Cluster
# Slurm Kubernetes Cluster

This is a multi-container Slurm cluster using Kubernetes. The Slurm cluster Helm chart creates a named volume for persistent storage of MySQL data files. By default, it also installs the
RookNFS Helm chart (also in this repo) to provide shared storage across the Slurm cluster nodes.
A Helm chart and Dockerfile to run a multi-container Slurm cluster on Kubernetes, featuring:

* Control, login, slurmd (worker), slurmdbd and mariadb pods.
* A shared `/home` directory across the slurm pods, by default via an install of RookNFS to provide a storage class with Read Write Many (RWX) capabilities.
* SSH and HTTPS access to the login pod with an Open Ondemand web GUI.
* A single slurmd pod per Kubernetes worker node with automatic definition of slurm node memory and CPU configuration.
* Slurm jobs run inside the slurmd pods, using host networking for maximum MPI performance.
* Open MPI installed with support for Slurm's `srun` launcher (via `pmix`) - see example below.
* Support for containerised jobs via Apptainer - see example below.
* Job accounting information retained across container upgrades via a persistent volume claim.
* Credentials/secrets are generated during the Helm install, not embedded in images.

## Dependencies

Expand Down Expand Up @@ -34,16 +43,6 @@ All config files in `slurm-cluster-chart/files` will be mounted into the contain

## Deploying the Cluster

### Generating Cluster Secrets

On initial deployment ONLY, run
```console
./generate-secrets.sh [<target-namespace>]
```
This generates a set of secrets in the target namespace to be used by the Slurm cluster. If these need to be regenerated, see "Reconfiguring the Cluster"

Be sure to take note of the Open Ondemand credentials, you will need them to access the cluster through a browser

### Connecting RWX Volume

A ReadWriteMany (RWX) volume is required for shared storage across cluster nodes. By default, the Rook NFS Helm chart is installed as a dependency of the Slurm cluster chart in order to provide a RWX capable Storage Class for the required shared volume. If the target Kubernetes cluster has an existing storage class which should be used instead, then `storageClass` in `values.yaml` should be set to the name of this existing class and the RookNFS dependency should be disabled by setting `rooknfs.enabled = false`. In either case, the storage capacity of the provisioned RWX volume can be configured by setting the value of `storage.capacity`.
Expand Down Expand Up @@ -163,10 +162,6 @@ Generally restarts to `slurmd`, `slurmctld`, `login` and `slurmdbd` will be requ

### Changes to secrets

Regenerate secrets by rerunning
```console
./generate-secrets.sh
```
Some secrets are persisted in volumes, so cycling them requires a full teardown and reboot of the volumes and pods which these volumes are mounted on. Run
```console
kubectl delete deployment mysql
Expand All @@ -178,4 +173,10 @@ and then restart the other dependent deployments to propagate changes:
kubectl rollout restart deployment slurmd slurmctld login slurmdbd
```

# Known Issues
# Limitations and Known Issues
- Only a single cluster should be deployed per Kubernetes namespace.
- Only the `rocky` user is currently supported.

# Acknowlegements

Originally based on https://github.com/giovtorres/slurm-docker-cluster which defines a docker-compose -based cluster.
Loading