Skip to content

Helpful slurm commands

Celeste-Melize Ferrus edited this page Dec 25, 2022 · 5 revisions

Slurm is the workload manager used for many of our compute clusters. Details of all slurm commands can be looked up online, but if you don't know what command you are looking for it can be difficult to figure out how to perform some tasks, or even to know that certain tasks are possible.

Basic day-to-day slurm operations (submitting jobs, canceling jobs, etc).

The documentation supplied by each compute cluster is pretty good here (if not, please add to this wiki!).

  • To cancel all jobs of a user, do scancel -u $USER
  • To start a job, do sbatch Submit.sh. If you'd like to know when it's done or when it fails, do sbatch --mail-user=$YOUR_EMAIL --mail-type=END --mail-type=FAIL Submit.sh
  • If you forget what a long running job was, do scontrol show jobid $JOB_ID, where $JOB_ID is the jobid of the job of interest

Querying user limits on queues (which are called 'partitions' by slurm)

This is something that you won't use every day, but can be important when debugging slurm problems.

  • To see basic information for all partitions:
sinfo -s
  • Each partition has associated with it a "quality of service" (QOS), which is a struct that specifies various user limits on that partition (i.e. the amount of run time, number of nodes, number of cores per node, amount of memory, etc that a user can request). To see the name of the QOS associated with each partition (plus more information associated with that partition), use the following command (look for QoS=<name> in the output; <name> is the name of the QOS):
scontrol show partition
  • If you want to find the limits associated with every QOS, listed by the name of each QOS, the command is:
sacctmgr show qos

For more details, these commands can be looked up here

Clone this wiki locally