-
Notifications
You must be signed in to change notification settings - Fork 2
Helpful slurm commands
Celeste-Melize Ferrus edited this page Dec 25, 2022
·
5 revisions
Slurm is the workload manager used for many of our compute clusters. Details of all slurm commands can be looked up online, but if you don't know what command you are looking for it can be difficult to figure out how to perform some tasks, or even to know that certain tasks are possible.
The documentation supplied by each compute cluster is pretty good here (if not, please add to this wiki!).
- To cancel all jobs of a user, do
scancel -u $USER
- To start a job, do
sbatch Submit.sh
. If you'd like to know when it's done or when it fails, dosbatch --mail-user=$YOUR_EMAIL --mail-type=END --mail-type=FAIL Submit.sh
- If you forget what a long running job was, do
scontrol show jobid $JOB_ID
, where $JOB_ID is the jobid of the job of interest
This is something that you won't use every day, but can be important when debugging slurm problems.
- To see basic information for all partitions:
sinfo -s
- Each partition has associated with it a "quality of service" (QOS), which is a struct that specifies various user limits on that partition (i.e. the amount of run time, number of nodes, number of cores per node, amount of memory, etc that a user can request). To see the name of the QOS associated with each partition (plus more information associated with that partition), use the following command (look for
QoS=<name>
in the output;<name>
is the name of the QOS):
scontrol show partition
- If you want to find the limits associated with every QOS, listed by the name of each QOS, the command is:
sacctmgr show qos
For more details, these commands can be looked up here