Skip to content

JHU-CLSP/rockfish-tutorial

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 

Repository files navigation

Rockfish Tutorial

This is the repo for the tutorial of runing tasks on Rockfish. Rockfish is a community-shared cluster, i.e., everyone can use any nodes within the limits of their allocated utilization. Every quarter you request a certain amount of GPU hours (i.e., number of cores x wall time) and provide justification for them. These assigned limits will be reset on a quarterly basis (use it or lose it).

To start, create an account here with your JHED.

IMPORTANT:

  • It is important for you to use JHED otherwise the system will have difficulty authenticating you.
  • If you're a student, you need to get access through your PI (usually you advisor).

Repository Content

  • test_gpus shows how to run a simple script on the GPU processor. You can run sbatch test_gpus.sh to submit the job and make sure that you have access to GPU nodes.
  • classification-example shows how to run a simple classification task on the GPU processor. You can run sbatch train_on_rockfish.sh to train a classifier on Rockfish.

FAQ

  • How do I find what the queue of a partition?
    • Type the command sinfo -s to get a list of the partitions.
    • sinfo -p partition-name will display the utilization for this partition.
  • How do I know what nodes are available?
    • Type the command sinfo -N to get a list of the nodes.
    • sinfo -N -p partition-name will display the utilization for this partition.
  • How do I interpret "states"?
    • idle: The node is available for use.
    • alloc: The node is currently being used by a job.
    • mix: Some of the node's processors are currently being used by a job.
    • Further details here: https://slurm.schedmd.com/sinfo.html
  • How do I see how many GPUs are available on each node?
    • Type the command sinfo -N -p a100 to get a list of the nodes and the number of GPUs available on each node.
  • How do I submit multiple jobs with this different parameters?
  • How do I submit a job to a specific node?
    • You can submit a job to a specific node by using the --nodelist flag.
    • For example, sbatch --nodelist=node1 job.sh will submit the job to node1.
    • You can also use the --exclude flag to exclude a node.
    • For example, sbatch --exclude=node1 job.sh will submit the job to any node except node1.
  • How do I create interactive session?
    • You can create an interactive session by using the salloc command.
    • For example, salloc -p a100 --gres=gpu:1 --time=00:30:00 will create an interactive session with 1 GPU for 30 minutes.
    • You can also interact command which internally makes call to salloc command.
  • How do I see the queue of my jobs?
    • You can see the queue of your jobs by using the squeue command.
    • You can specialize it for a partition: squeue -p a100
    • For example, squeue -u userid will show the queue of your jobs.
  • What do status labels mean in the output of squeue?
    • PD: Pending. The job is awaiting resource allocation.
    • R: Running. The job currently has an allocation.
    • CG: Completing. The job is in the process of completing. Some processes on some nodes may still be active.
    • CD: Completed. The job has terminated all processes on all nodes.
    • F: Failed. The job terminated with non-zero exit code or other failure condition.
    • TO: Timeout. The job terminated upon reaching its time limit.
    • NF: Node Failure. The job terminated due to failure of one or more allocated nodes.
    • CA: Canceled. The job was explicitly canceled by the user or system administrator. The job may or may not have been initiated.
    • SE: Special Exit. The job was requeued in a special state by the scheduler; it may or may not have been initiated.
    • ST: Suspended. The job has an allocation, but execution has been suspended and CPUs have been released for other jobs.
    • S: Suspended by user. The job has an allocation, but execution has been suspended and CPUs have been released for other jobs at the request of the user.
    • PR: Preempted. The job was preempted.
  • How do I cancel a job?
    • You can cancel a job by using the scancel command.
    • For example, scancel jobid will cancel the job with the given jobid.
  • How do I check the statistics of a finished job?
    • You can check the statistics of a finished job by using the sacct command.
    • For example, sacct -j jobid will show the statistics of the job with the given jobid.

Additional Resources

If you struggle with SLURM, these are useful cheatsheets:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 83.1%
  • Shell 16.9%