Skip to content

Commit

Permalink
Merge pull request #21 from stackhpc/hook-race-fix
Browse files Browse the repository at this point in the history
Hook race fix
  • Loading branch information
wtripp180901 authored Aug 18, 2023
2 parents def4a77 + 89981e6 commit a0a2323
Show file tree
Hide file tree
Showing 4 changed files with 47 additions and 3 deletions.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,8 @@ Subsequent releases can be deployed using:
helm upgrade <deployment-name> slurm-cluster-chart
```

Note: When updating the cluster with `helm upgrade`, a pre-upgrade hook will prevent upgrades if there are running jobs in the Slurm queue. Attempting to upgrade will set all Slurm nodes to `DRAINED` state. If an upgrade fails due to running jobs, you can undrain the nodes either by waiting for running jobs to complete and then retrying the upgrade or by manually undraining them by accessing the cluster as a privileged user. Alternatively you can bypass the hook by running `helm upgrade` with the `--no-hooks` flag (may result in running jobs being lost)

## Accessing the Cluster

Retrieve the external IP address of the login node using:
Expand Down
12 changes: 10 additions & 2 deletions image/docker-entrypoint.sh
Original file line number Diff line number Diff line change
Expand Up @@ -141,15 +141,23 @@ elif [ "$1" = "check-queue-hook" ]
then
start_munge

scontrol update NodeName=all State=DRAIN Reason="Preventing new jobs running before upgrade"

RUNNING_JOBS=$(squeue --states=RUNNING,COMPLETING,CONFIGURING,RESIZING,SIGNALING,STAGE_OUT,STOPPED,SUSPENDED --noheader --array | wc --lines)

if [[ $RUNNING_JOBS -eq 0 ]]
then
exit 0
exit 0
else
exit 1
exit 1
fi

elif [ "$1" = "undrain-nodes-hook" ]
then
start_munge
scontrol update NodeName=all State=UNDRAIN
exit 0

elif [ "$1" = "generate-keys-hook" ]
then
mkdir -p ./temphostkeys/etc/ssh
Expand Down
34 changes: 34 additions & 0 deletions slurm-cluster-chart/templates/undrain-nodes-hook.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
apiVersion: batch/v1
kind: Job
metadata:
name: undrain-nodes-hook
annotations:
"helm.sh/hook": post-upgrade
"helm.sh/hook-delete-policy": hook-succeeded
spec:
backoffLimit: 0
ttlSecondsAfterFinished: 0
template:
metadata:
name: undrain-nodes-hook
spec:
restartPolicy: Never
containers:
- name: undrain-nodes-hook
image: {{ .Values.slurmImage }}
args:
- undrain-nodes-hook
volumeMounts:
- mountPath: /tmp/munge.key
name: munge-key-secret
subPath: munge.key
- mountPath: /etc/slurm/
name: slurm-config-volume
volumes:
- name: munge-key-secret
secret:
secretName: {{ .Values.secrets.mungeKey }}
defaultMode: 0400
- name: slurm-config-volume
configMap:
name: {{ .Values.configmaps.slurmConf }}
2 changes: 1 addition & 1 deletion slurm-cluster-chart/values.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
slurmImage: ghcr.io/stackhpc/slurm-docker-cluster:d3daba4
slurmImage: ghcr.io/stackhpc/slurm-docker-cluster:1f51003

login:
# Deployment resource name
Expand Down

0 comments on commit a0a2323

Please sign in to comment.