Skip to content

Commit

Permalink
Merge branch 'main' into ood-shell-timeout-fix
Browse files Browse the repository at this point in the history
  • Loading branch information
wtripp180901 committed Aug 18, 2023
2 parents 68d1cf9 + def4a77 commit d418d3b
Show file tree
Hide file tree
Showing 47 changed files with 584 additions and 268 deletions.
6 changes: 3 additions & 3 deletions .github/workflows/build-containers.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,8 @@ on:
push:
paths:
- .github/workflows/build-containers.yml
- Dockerfile
- docker-entrypoint.sh
workflow_dispatch:
- image/**
workflow_dispatch:

jobs:
build_push_api:
Expand Down Expand Up @@ -49,6 +48,7 @@ jobs:
with:
provenance: false
push: true
context: image/
tags: ${{ steps.image-meta.outputs.tags }}
labels: ${{ steps.image-meta.outputs.labels }}
cache-from: type=local,src=/tmp/.buildx-cache
Expand Down
47 changes: 18 additions & 29 deletions .github/workflows/publish-helm-chart.yml
Original file line number Diff line number Diff line change
@@ -1,37 +1,26 @@
name: Release Charts

on:
push:
branches:
- master

name: Publish charts
# Run the tasks on every push
on: push
jobs:
release:
# depending on default permission settings for your org (contents being read-only or read-write for workloads), you will have to add permissions
# see: https://docs.github.com/en/actions/security-guides/automatic-token-authentication#modifying-the-permissions-for-the-github_token
permissions:
contents: write
publish_charts:
name: Build and push Helm charts
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Check out the repository
uses: actions/checkout@v2
with:
# This is important for the semver action to work correctly
# when determining the number of commits since the last tag
fetch-depth: 0
submodules: true

- name: Configure Git
run: |
git config user.name "$GITHUB_ACTOR"
git config user.email "$GITHUB_ACTOR@users.noreply.github.com"
- name: Install Helm
uses: azure/setup-helm@v3
env:
GITHUB_TOKEN: "${{ secrets.GITHUB_TOKEN }}"
- name: Get SemVer version for current commit
id: semver
uses: stackhpc/github-actions/semver@master

- name: Run chart-releaser
uses: helm/chart-releaser-action@v1.5.0
- name: Publish Helm charts
uses: stackhpc/github-actions/helm-publish@master
with:
charts_dir: .
env:
CR_TOKEN: "${{ secrets.GITHUB_TOKEN }}"

token: ${{ secrets.GITHUB_TOKEN }}
version: ${{ steps.semver.outputs.version }}
app-version: ${{ steps.semver.outputs.short-sha }}
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Build artifacts from local helm install
slurm-cluster-chart/Chart.lock
slurm-cluster-chart/charts/
57 changes: 40 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
# Slurm Docker Cluster

This is a multi-container Slurm cluster using Kubernetes. The Helm chart
creates a named volume for persistent storage of MySQL data files as well as
an NFS volume for shared storage.
This is a multi-container Slurm cluster using Kubernetes. The Slurm cluster Helm chart creates a named volume for persistent storage of MySQL data files. By default, it also installs the
RookNFS Helm chart (also in this repo) to provide shared storage across the Slurm cluster nodes.

## Dependencies

Expand All @@ -27,47 +26,51 @@ The Helm chart will create the following named volumes:

* var_lib_mysql ( -> /var/lib/mysql )

A named ReadWriteMany (RWX) volume mounted to `/home` is also expected, this can be external or can be deployed using the scripts in the `/nfs` directory (See "Deploying the Cluster")
A named ReadWriteMany (RWX) volume mounted to `/home` is also expected, this can be external or can be deployed using the provided `rooknfs` chart directory (See "Deploying the Cluster").

## Configuring the Cluster

All config files in `slurm-cluster-chart/files` will be mounted into the container to configure their respective services on startup. Note that changes to these files will not all be propagated to existing deployments (see "Reconfiguring the Cluster").
Additional parameters can be found in the `values.yaml` file, which will be applied on a Helm chart deployment. Note that some of these values will also not propagate until the cluster is restarted (see "Reconfiguring the Cluster").
All config files in `slurm-cluster-chart/files` will be mounted into the container to configure their respective services on startup. Note that changes to these files will not all be propagated to existing deployments (see "Reconfiguring the Cluster"). Additional parameters can be found in the `values.yaml` file for the Helm chart. Note that some of these values will also not propagate until the cluster is restarted (see "Reconfiguring the Cluster").

## Deploying the Cluster

### Generating Cluster Secrets

On initial deployment ONLY, run
```console
./generate-secrets.sh
./generate-secrets.sh [<target-namespace>]
```
This generates a set of secrets. If these need to be regenerated, see "Reconfiguring the Cluster"
This generates a set of secrets in the target namespace to be used by the Slurm cluster. If these need to be regenerated, see "Reconfiguring the Cluster"

Be sure to take note of the Open Ondemand credentials, you will need them to access the cluster through a browser

### Connecting RWX Volume

A ReadWriteMany (RWX) volume is required, if a named volume exists, set `nfs.claimName` in the `values.yaml` file to its name. If not, manifests to deploy a Rook NFS volume are provided in the `/nfs` directory. You can deploy this by running
```console
/nfs/deploy-nfs.sh
```
and leaving `nfs.claimName` as the provided value.
A ReadWriteMany (RWX) volume is required for shared storage across cluster nodes. By default, the Rook NFS Helm chart is installed as a dependency of the Slurm cluster chart in order to provide a RWX capable Storage Class for the required shared volume. If the target Kubernetes cluster has an existing storage class which should be used instead, then `storageClass` in `values.yaml` should be set to the name of this existing class and the RookNFS dependency should be disabled by setting `rooknfs.enabled = false`. In either case, the storage capacity of the provisioned RWX volume can be configured by setting the value of `storage.capacity`.

See the separate RookNFS chart [values.yaml](./rooknfs/values.yaml) for further configuration options when using the RookNFS to provide the shared storage volume.

### Supplying Public Keys

To access the cluster via `ssh`, you will need to make your public keys available. All your public keys from localhost can be added by running

```console
./publish-keys.sh
./publish-keys.sh [<target-namespace>]
```
where `<target-namespace>` is the namespace in which the Slurm cluster chart will be deployed (i.e. using `helm install -n <target-namespace> ...`). This will create a Kubernetes Secret in the appropriate namespace for the Slurm cluster to use. Omitting the namespace arg will install the secrets in the default namespace.

### Deploying with Helm

After configuring `kubectl` with the appropriate `kubeconfig` file, deploy the cluster using the Helm chart:
```console
helm install <deployment-name> slurm-cluster-chart
```

NOTE: If using the RookNFS dependency, then the following must be run before installing the Slurm cluster chart
```console
helm dependency update slurm-cluster-chart
```

Subsequent releases can be deployed using:

```console
Expand Down Expand Up @@ -130,15 +133,33 @@ srun singularity exec docker://ghcr.io/stackhpc/mpitests-container:${MPI_CONTAIN
```

Note: The mpirun script assumes you are running as user 'rocky'. If you are running as root, you will need to include the --allow-run-as-root argument

## Reconfiguring the Cluster

### Changes to config files

To guarantee changes to config files are propagated to the cluster, use
Changes to the Slurm configuration in `slurm-cluster-chart/files/slurm.conf` will be propagated (it may take a few seconds) to `/etc/slurm/slurm.conf` for all pods except the `slurmdbd` pod by running

```console
kubectl rollout restart deployment <deployment-names>
helm upgrade <deployment-name> slurm-cluster-chart/
```
Generally restarts to `slurmd`, `slurmctld`, `login` and `slurmdbd` will be required

The new Slurm configuration can then be read by running `scontrol reconfigure` as root inside a Slurm pod. The [slurm.conf documentation](https://slurm.schedmd.com/slurm.conf.html) notes that some changes require a restart of all daemons, which here requires redeploying the Slurm pods as described below.

Changes to other configuration files (e.g. Munge key etc) require a redeploy of the appropriate pods.

To redeploy pods use:
```console
kubectl rollout restart deployment <deployment-names ...>
```
for the `slurmdbd`, `login` and `mysql` pods and

```
kubectl rollout restart statefulset <statefulset-names ...>
```
for the `slurmd` and `slurmctld` pods

Generally restarts to `slurmd`, `slurmctld`, `login` and `slurmdbd` will be required.

### Changes to secrets

Expand All @@ -156,3 +177,5 @@ and then restart the other dependent deployments to propagate changes:
```console
kubectl rollout restart deployment slurmd slurmctld login slurmdbd
```

# Known Issues
35 changes: 0 additions & 35 deletions generate-secrets.sh

This file was deleted.

3 changes: 3 additions & 0 deletions Dockerfile → image/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ LABEL org.opencontainers.image.source="https://github.com/stackhpc/slurm-docker-
ARG SLURM_TAG=slurm-23.02
ARG GOSU_VERSION=1.11

COPY kubernetes.repo /etc/yum.repos.d/kubernetes.repo

RUN set -ex \
&& yum makecache \
&& yum -y update \
Expand Down Expand Up @@ -46,6 +48,7 @@ RUN set -ex \
openssh-server \
apptainer \
ondemand \
kubectl \
&& yum clean all \
&& rm -rf /var/cache/yum

Expand Down
Loading

0 comments on commit d418d3b

Please sign in to comment.