The high performance computing (HPC) resources of The Jackson Laboratory represent a shared resource available to all JAX researchers, and in order to keep these resources available to all researchers in a consistent and fair manner, a number of walltime-based queues have been implemented on these resources. These queues allow the Information Technology department the ability to better plan maintenance and schedule upgrade windows on these systems, while providing a more consistent and stable operating environment for JAX HPC users.
The squeue account provides details about many things including the partitions configured currently
squeue
squeue -u <username>
squeue --help
Examples:
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
643231 gpu RnnTrain zhaoyu PD 0:00 1 (Resources)
643232 gpu RnnTrain zhaoyu PD 0:00 1 (Priority)
639633_[68-135%30] gpu track_gr sheppk PD 0:00 1 (JobArrayTaskLimit)
552775_7337 compute checkBug zhaoyu PD 0:00 1 (launch failed requeued held)
627649 compute build tewher R 1-03:00:18 1 sumner054
627650 compute build tewher R 1-03:00:18 1 sumner054
627651 compute build tewher R 1-03:00:18 1 sumner054
627652 compute build tewher R 1-03:00:18 1 sumner054
627648 compute build tewher R 1-03:00:28 1 sumner054
squeue -u tewher
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
627649 compute build tewher R 1-03:02:14 1 sumner054
627650 compute build tewher R 1-03:02:14 1 sumner054
627651 compute build tewher R 1-03:02:14 1 sumner054
627652 compute build tewher R 1-03:02:14 1 sumner054
627648 compute build tewher R 1-03:02:24 1 sumner054
638914 compute build tewher R 22:07:11 1 sumner014
638981 compute build tewher R 22:07:11 1 sumner014
638982 compute build tewher R 22:07:11 1 sumner039
638984 compute build tewher R 22:07:11 1 sumner054
638985 compute build tewher R 22:07:11 1 sumner014
squeue --help
Usage: squeue [OPTIONS]
-A, --account=account(s) comma separated list of accounts
to view, default is all accounts
-a, --all display jobs in hidden partitions
--array-unique display one unique pending job array
element per line
--federation Report federated information if a member
of one
-h, --noheader no headers on output
--hide do not display jobs in hidden partitions
-i, --iterate=seconds specify an interation period
-j, --job=job(s) comma separated list of jobs IDs
to view, default is all
--local Report information only about jobs on the
local cluster. Overrides --federation.
-l, --long long report
-L, --licenses=(license names) comma separated list of license names to view
-M, --clusters=cluster_name cluster to issue commands to. Default is
current cluster. cluster with no name will
reset to default. Implies --local.
-n, --name=job_name(s) comma separated list of job names to view
--noconvert don't convert units from their original type
(e.g. 2048M won't be converted to 2G).
-o, --format=format format specification
-O, --Format=format format specification
-p, --partition=partition(s) comma separated list of partitions
to view, default is all partitions
-q, --qos=qos(s) comma separated list of qos's
to view, default is all qos's
-R, --reservation=name reservation to view, default is all
-r, --array display one job array element per line
--sibling Report information about all sibling jobs
on a federated cluster. Implies --federation.
-s, --step=step(s) comma separated list of job steps
to view, default is all
-S, --sort=fields comma separated list of fields to sort on
--start print expected start times of pending jobs
-t, --states=states comma separated list of states to view,
default is pending and running,
'--states=all' reports all states
-u, --user=user_name(s) comma separated list of users to view
--name=job_name(s) comma separated list of job names to view
-v, --verbose verbosity level
-V, --version output version information and exit
-w, --nodelist=hostlist list of nodes to view, default is
all nodes
Help options:
--help show this help message
--usage display a brief summary of squeue options
The sacctmgr
(or Slurm Account Manager) command shows the current QOS (aka queues) on the HPC environment.
Examples:
sacctmgr
sacctmgr
sacctmgr show qos
sacctmgr --help
sacctmgr show qos
$ sacctmgr show qos
Name Priority GraceTime Preempt PreemptMode Flags UsageThres UsageFactor GrpTRES GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit GrpWall MaxTRES MaxTRESPerNode MaxTRESMins MaxWall MaxTRESPU MaxJobsPU MaxSubmitPU MaxTRESPA MaxJobsPA MaxSubmitPA MinTRES
---------- ---------- ---------- ---------- ----------- ---------------------------------------- ---------- ----------- ------------- ------------- ------------- ------- --------- ----------- ------------- -------------- ------------- ----------- ------------- --------- ----------- ------------- --------- ----------- -------------
long 0 00:00:00 cluster 1.000000 14-00:00:00 cpu=3600
batch 0 00:00:00 cluster 1.000000 3-00:00:00 cpu=3600
sacctmgr --help
$ sacctmgr --help
sacctmgr [<OPTION>] [<COMMAND>]
Valid <OPTION> values are:
-h or --help: equivalent to "help" command
-i or --immediate: commit changes immediately
-n or --noheader: no header will be added to the beginning of output
-o or --oneliner: equivalent to "oneliner" command
-p or --parsable: output will be '|' delimited with a '|' at the end
-P or --parsable2: output will be '|' delimited without a '|' at the end
-Q or --quiet: equivalent to "quiet" command
-r or --readonly: equivalent to "readonly" command
-s or --associations: equivalent to "associations" command
-v or --verbose: equivalent to "verbose" command
-V or --version: equivalent to "version" command
<keyword> may be omitted from the execute line and sacctmgr will execute
in interactive mode. It will process commands as entered until explicitly
terminated.
Valid <COMMAND> values are:
add <ENTITY> <SPECS> add entity
archive <DUMP/LOAD> <SPECS>
Archive past jobs and/or steps, or load them
back into the databse.
associations when using show/list will list the
associations associated with the entity.
clear stats clear server statistics
delete <ENTITY> <SPECS> delete the specified entity(s)
dump <CLUSTER> [File=<FILENAME>]
dump database information of the
specified cluster to the flat file.
Will default to clustername.cfg if no file
is given.
exit terminate sacctmgr
help print this description of use.
list <ENTITY> [<SPECS>] display info of identified entity, default
is display all.
load <FILE> [<SPECS>] read in the file to update the database
with the file contents. <SPECS> here consist
of 'cluster=', and 'clean'. The 'cluster='
will override the cluster name given in the
file. The 'clean' option will remove what is
already in the system for this cluster and
replace it with the file. If the clean option
is not given only new additions or
modifications will be done, no deletions.
modify <ENTITY> <SPECS> modify entity
oneliner report output one record per line.
parsable output will be | delimited with an ending '|'
parsable2 output will be | delimited without an ending '|'
quiet print no messages other than error messages.
quit terminate this command.
readonly makes it so no modification can happen.
reconfigure reread the slurmdbd.conf on the DBD.
shutdown shutdown the server.
show same as list
verbose enable detailed logging.
version display tool version number.
!! Repeat the last command entered.
<ENTITY> may be "account", "association", "cluster",
"configuration", "coordinator", "event",
"federation", "job", "problem", "qos",
"resource", "reservation", "runawayjobs", "stats"
"transaction", "tres", "user" or "wckey"
<SPECS> are different for each command entity pair.
list account - Clusters=, Descriptions=, Format=,
Names=, Organizations=, Parents=, WithAssoc,
WithDeleted, WithCoordinators, WithRawQOS,
and WOPLimits
add account - Clusters=, DefaultQOS=, Description=, Fairshare=,
GrpTRESMins=, GrpTRES=, GrpJobs=, GrpMemory=,
GrpNodes=, GrpSubmitJob=, GrpWall=, MaxTRESMins=,
MaxTRES=, MaxJobs=, MaxNodes=, MaxSubmitJobs=,
MaxWall=, Names=, Organization=, Parent=,
and QosLevel=
modify account - (set options) DefaultQOS=, Description=,
Fairshare=, GrpTRESMins=, GrpTRESRunMins=,
GrpTRES=, GrpJobs=, GrpMemory=, GrpNodes=,
GrpSubmitJob=, GrpWall=, MaxTRESMins=, MaxTRES=,
MaxJobs=, MaxNodes=, MaxSubmitJobs=, MaxWall=,
Names=, Organization=, Parent=, and QosLevel=
RawUsage= (with admin privileges only)
(where options) Clusters=, DefaultQOS=,
Descriptions=, Names=, Organizations=,
Parent=, and QosLevel=
delete account - Clusters=, DefaultQOS=, Descriptions=, Names=,
Organizations=, and Parents=
list associations - Accounts=, Clusters=, Format=, ID=, OnlyDefaults,
Partitions=, Parent=, Tree, Users=,
WithSubAccounts, WithDeleted, WOLimits,
WOPInfo, and WOPLimits
list cluster - Classification=, DefaultQOS=, Federation=,
Flags=, Format=, Names=, RPC= WithFed and
WOLimits
add cluster - DefaultQOS=, Fairshare=, Federation=, FedState=,
GrpTRES=, GrpJobs=, GrpMemory=, GrpNodes=,
GrpSubmitJob=, MaxTRESMins=, MaxJobs=,
MaxNodes=, MaxSubmitJobs=, MaxWall=, Name=,
QosLevel= and Weight=
modify cluster - (set options) DefaultQOS=, Fairshare=,
Federation=, FedState=, GrpTRES=, GrpJobs=,
GrpMemory=, GrpNodes=, GrpSubmitJob=,
MaxTRESMins=, MaxJobs=, MaxNodes=,
MaxSubmitJobs=, MaxWall=, QosLevel= and Weight=
(where options) Classification=, Federation=,
Flags=, and Names=
delete cluster - Classification=, DefaultQOS=, Flags=, and Names=
add coordinator - Accounts=, and Names=
delete coordinator - Accounts=, and Names=
list events - All_Clusters, All_Time, Clusters=, End=, Events=,
Format=, MaxCPUs=, MinCPUs=, Nodes=, Reason=,
Start=, States=, and User=
list federation - Names=, Format= and Tree
add federation - Flags=, Clusters= and Name=
modify federation - (set options) Clusters= and Flags=
(where options) Names=
delete federation - Names=
modify job - (set options) DerivedExitCode=, Comment=
(where options) JobID=, Cluster=
list qos - Descriptions=, Format=, Id=, Names=,
PreemptMode=, and WithDeleted
add qos - Description=, Flags=, GraceTime=, GrpJobs=,
GrpSubmitJob=, GrpTRES=, GrpTRESMins=, GrpWall=,
MaxJobs=, MaxSubmitJobsPerUser=, MaxTRESMins=,
MaxTRESPerJob=, MaxTRESPerNode=, MaxTRESPerUser=,
MaxWall=, Names=, Preempt=, PreemptMode=,
Priority=, UsageFactor=, and UsageThreshold=
modify qos - (set options) Description=, Flags=, GraceTime=,
GrpJobs=, GrpSubmitJob=, GrpTRES=, GrpTRESMins=,
GrpWall=,
MaxJobs=, MaxSubmitJobsPerUser=, MaxTRESMins=,
MaxTRESPerJob=, MaxTRESPerNode=, MaxTRESPerUser=,
MaxWall=, Names=, Preempt=, PreemptMode=,
Priority=, RawUsage= (admin only),
UsageFactor=, and UsageThreshold=
(where options) Descriptions=, ID=, Names=
and PreemptMode=
delete qos - Descriptions=, ID=, Names=, and PreemptMode=
list resource - Clusters=, Descriptions=, Flags=, Format=, Ids=,
Names=, PercentAllowed=, ServerType=, Servers=,
and WithClusters
add resource - Clusters=, Count=, Descriptions=, Flags=,
ServerType=, Names=, PercentAllowed=, Server=,
and Type=
modify resource - (set options) Count=, Flags=, Manager=,
PercentAllowed=,
(where options) Clusters=, Names=, Servers=,
delete resource - Clusters=, Names=
list reservation - Clusters=, End=, ID=, Names=, Nodes=, Start=
list runawayjobs - Cluster=, Format=
clear stats
list stats
list transactions - Accounts=, Action=, Actor=, Clusters=, End=,
Format=, ID=, Start=, User=, and WithAssoc
list tres - ID=, Name=, Type=, WithDeleted
list user - AdminLevel=, DefaultAccount=,
DefaultWCKey=, Format=, Names=,
QosLevel=, WithAssoc, WithCoordinators,
WithDeleted, WithRawQOS, and WOPLimits
add user - Accounts=, AdminLevel=, Clusters=,
DefaultAccount=, DefaultQOS=, DefaultWCKey=,
Fairshare=, MaxTRESMins=, MaxTRES=,
MaxJobs=, MaxNodes=, MaxSubmitJobs=, MaxWall=,
Names=, Partitions=, and QosLevel=
modify user - (set options) AdminLevel=, DefaultAccount=,
DefaultQOS=, DefaultWCKey=, Fairshare=,
MaxTRESMins=, MaxTRES=, MaxJobs=, MaxNodes=,
MaxSubmitJobs=, MaxWall=, NewName=,
and QosLevel=,
RawUsage= (with admin privileges only)
(where options) Accounts=, AdminLevel=,
Clusters=, DefaultAccount=, Names=,
Partitions=, and QosLevel=
delete user - Accounts=, AdminLevel=, Clusters=,
DefaultAccount=, DefaultWCKey=, and Names=
list wckey - Clusters=, End=, Format=, ID=, Names=,
Start=, User=, and WithDeleted
archive dump - Directory=, Events, Jobs,
PurgeEventAfter=, PurgeJobAfter=,
PurgeStepAfter=, PurgeSuspendAfter=,
Script=, Steps, and Suspend
archive load - File=, or Insert=
Format options are different for listing each entity pair.
One can get an number of characters by following the field option with
a %NUMBER option. i.e. format=name%30 will print 30 chars of field name.
Account - Account, Coordinators, Description, Organization
Association - Account, Cluster, DefaultQOS, Fairshare,
GrpTRESMins, GrpTRESRunMins, GrpTRES, GrpJobs,
GrpMemory, GrpNodes, GrpSubmitJob, GrpWall,
ID, LFT, MaxTRESMins, MaxTRES,
MaxJobs, MaxNodes, MaxSubmitJobs, MaxWall, QOS,
ParentID, ParentName, Partition, RGT,
User, WithRawQOS
Cluster - Classification, Cluster, ClusterNodes,
ControlHost, ControlPort, DefaultQOS,
Fairshare, Flags, GrpTRESMins, GrpTRES GrpJobs,
GrpMemory, GrpNodes, GrpSubmitJob, MaxTRESMins,
MaxTRES, MaxJobs, MaxNodes, MaxSubmitJobs,
MaxWall, NodeCount, PluginIDSelect, RPC, TRES
Event - Cluster, ClusterNodes, Duration, End,
Event, EventRaw, NodeName, Reason, Start,
State, StateRaw, TRES, User
QOS - Description, Flags, GraceTime, GrpJobs,
GrpSubmitJob, GrpTRES, GrpTRESMins, GrpWall,
MaxJobs, MaxSubmitJobsPerUser, MaxTRESMins,
MaxTRESPerJob, MaxTRESPerNode, MaxTRESPerUser,
MaxWall, Name, Preempt, PreemptMode,
Priority, UsageFactor, UsageThreshold
Resource - Cluster, Count, CountAllowed, CountUsed,
Description, Flags, Manager, Name,
PercentAllowed, PercentUsed, Server, Type
Reservation - Assoc, Cluster, End, Flags, ID, Name,
NodeNames, Start, TRES, UnusedWall
RunAwayJobs - Cluster, ID, Name, Partition, State,
TimeStart, TimeEnd
Transactions - Action, Actor, Info, TimeStamp, Where
TRES - ID, Name, Type
User - AdminLevel, Coordinators, DefaultAccount,
DefaultWCKey, User
WCKey - Cluster, ID, Name, User
Account/User WithAssoc option will also honor
all of the options for Association.
All commands entitys, and options are case-insensitive.
- Currently no restrictions on interactive job count
- Jobs will run as long as the QOS allows
- Up to 700 cores running per user, max walltime = 72 hours.
- This queue allows users to submit a large number of medium-runtime, high-resource jobs to the Jax HPC clusters. This queue does not allow interactivity. Users should use interactivity available in the
interactive
queue. Jobs that do not specify a QOS will not be scheduled to run.
- 504 hours max walltime.
- This queue is where long-runtime jobs should be submitted.
- Jobs requiring greater then 768gb of memory.
- 72 hours max walltime.
- Please request approval to access the 'high_mem' queue via helpdesk. A cluster job ID will be required to confirm the requested access. A limited set of resources dedicated to running jobs that have memory requirements greater than 7 GB.
- Interactivity allowed. Job arrays allowed.
- 72 hours max walltime.
- This queue is for jobs requiring NVIDIA GPU technology. Jobs observed not utilizing GPU resources will be terminated without warning. A limited set of resources dedicated to running jobs that require GPU support.
- The
special
queue is reserved for special requests by researchers in conjunction with the IT department. Users should consult with IT if their jobs require more than 3 weeks to complete or have other needs that the above queues do not address. Output showing the walltime exceeded from your job in thelong
queue or other reasons the job can not function within the existing queue configurations will be required.