Skip to content

Commit

Permalink
Merge pull request #20 from ucdavis/johnmc-changes
Browse files Browse the repository at this point in the history
Added general troubleshooting page
  • Loading branch information
camillescottatwork authored Jun 26, 2024
2 parents bef27db + 0c58fd3 commit 386a925
Show file tree
Hide file tree
Showing 3 changed files with 147 additions and 4 deletions.
76 changes: 76 additions & 0 deletions docs/general/troubleshooting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
## Common SSH Issues

Here are some of the most common issues users face when using SSH.

### Keys

The following clusters use SSH keys: Atomate, Farm, Franklin, HPC1, HPC2, Impact, Peloton.

If you connect to one of these and are asked for a password (as distinct from a passphrase for your key),
your key is not being recognized. This is usually because of permissions or an unexpected filename.
SSH expects your key to be one of a specific set of names. Unless you have specified something other than
the default, this is probably going to be `$HOME/.ssh/id_rsa`.

If you specified a different name when generating your key, you can specify it like this:

```bash
ssh -i $HOME/.ssh/newkey [USER]@[cluster].hpc.ucdavis.edu
```

If you kept the default value, your permissions should be set so that only you can read and write the key `(-rw------- or 600)`.
To ensure this is the case, you can do the following:

```bash
chown 600 $HOME/.ssh/id_rsa
```

On HPC2, your public key is kept in `$HOME/.ssh/authorized_keys`. Please make sure to not remove your key from this file.
Doing so will cause you will lose access.

If you are trying to use a key to access LSSC0 or any of the Genome Center login nodes, SSH keys will not work, but there is
another method.

To enable logins without a password, you will need to enable GSSAPI, which
some systems enable by default. If not enabled, add the following to your
`$HOME/.ssh/config` file (create it if it doesn't exist):

GSSAPIAuthentication yes
GSSAPIDelegateCredentials yes

The `-K` command line switch to ssh does the same thing on a one-time
basis.

Once you have `GSSAPI` enabled, you can get a Kerberos ticket using

```bash
kinit [USER]@GENOMECENTER.UCDAVIS.EDU
```

SSH will use that ticket while it's valid.

## Common Slurm Scheduler Issues

These are the most common issues with job scheduling using Slurm.

### Using a non-default account

If you have access to more than one Slurm account and wish to use an account other than your default,
use the `-A` or `--account` flag.

e.g. If your default account is in `foogrp` and you wish to use `bargrp`:
```bash
srun -A bargrp -t 1:00:00 --mem=20GB scriptname.sh
```

### No default account

Newer slurm accounts have no default specified, and in this case you might get error message like:

```
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
```

You will need to specify the account explicitly as explained [above](#no-default-account).
You can find out how to view your Slurm account information in the [resources
section](../scheduler/resources.md).

74 changes: 70 additions & 4 deletions docs/scheduler/resources.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,71 @@
# Requesting Resources


## Partitions

Each **node** -- physically distinct machines within the cluster -- will be a member of one or more
**partitions**. A partition consists of a collection of nodes, a policy for job scheduling on that
partition, a policy for conflicts when nodes are a member of more than one partition (ie.
preemption), and a policy for managing and restricting resources per user or per group referred to
as Quality of Service.
The Slurm documentation has detailed information on how [preemption](https://slurm.schedmd.com/preempt.html) and [QOS
definitions](https://slurm.schedmd.com/qos.html) are handled; our per-cluster _Resources_ sections
describe how partitions are organized and preemption handled on our clusters.

## Accounts

Users are granted access to resources via Slurm **associations**. An association links together a
**user** with an **account** and a QOS definition. **Accounts** most commonly correspond to your
lab, but sometimes exist for graduate groups, departments, or institutes.

To see your associations, and thus which accounts and partitions you have access to, you can use the
`sacctmgr` command:

``` console
$ sacctmgr show assoc user=$USER
Cluster Account User Partition Share ... MaxTRESMins QOS Def QOS GrpTRESRunMin
---------- ---------- ---------- ---------- --------- ... ------------- -------------------- --------- -------------
franklin hpccfgrp camw mmgdept-g+ 1 ... hpccfgrp-mmgdept-gp+
franklin hpccfgrp camw mmaldogrp+ 1 ... hpccfgrp-mmaldogrp-+
franklin hpccfgrp camw cashjngrp+ 1 ... hpccfgrp-cashjngrp-+
franklin hpccfgrp camw jalettsgr+ 1 ... hpccfgrp-jalettsgrp+
franklin hpccfgrp camw jawdatgrp+ 1 ... hpccfgrp-jawdatgrp-+
franklin hpccfgrp camw low 1 ... hpccfgrp-low-qos
franklin hpccfgrp camw high 1 ... hpccfgrp-high-qos
franklin jawdatgrp camw low 1 ... mcbdept-low-qos
franklin jawdatgrp camw jawdatgrp+ 1 ... jawdatgrp-jawdatgrp+
franklin jalettsgrp camw jalettsgr+ 1 ... jalettsgrp-jalettsg+
franklin jalettsgrp camw low 1 ... mcbdept-low-qos
```

The output is very wide, so you may want to pipe it through `less` to make it more readable:

``` console
sacctmgr show assoc user=$USER | less -S
```

Or, perhaps preferably, output it in a more compact format:

``` console
$ sacctmgr show assoc user=camw format="account%20,partition%20,qos%40"
Account Partition QOS
-------------------- -------------------- ----------------------------------------
hpccfgrp mmgdept-gpu hpccfgrp-mmgdept-gpu-qos
hpccfgrp mmaldogrp-gpu hpccfgrp-mmaldogrp-gpu-qos
hpccfgrp cashjngrp-gpu hpccfgrp-cashjngrp-gpu-qos
hpccfgrp jalettsgrp-gpu hpccfgrp-jalettsgrp-gpu-qos
hpccfgrp jawdatgrp-gpu hpccfgrp-jawdatgrp-gpu-qos
hpccfgrp low hpccfgrp-low-qos
hpccfgrp high hpccfgrp-high-qos
jawdatgrp low mcbdept-low-qos
jawdatgrp jawdatgrp-gpu jawdatgrp-jawdatgrp-gpu-qos
jalettsgrp jalettsgrp-gpu jalettsgrp-jalettsgrp-gpu-qos
jalettsgrp low mcbdept-low-qos
```

In the above example, we can see that user `camw` has access to the `high` partition via an
association with `hpccfgrp` and the `jalettsgrp-gpu` partition via the `jalettsgrp` account.

## Resource Types

### CPUs / cores
Expand All @@ -14,8 +80,8 @@ Slurm's CPU management methods are complex and can quickly become confusing.
For the purposes of this documentation, we will provide a simplified explanation; those with advanced needs
should consult [the Slurm documentation](https://slurm.schedmd.com/cpu_management.html).

Slurm follows a distinction between its physically resources -- cluster nodes and CPUs or cores on a node -- and virtual
resources, or **tasks**, which specificy how requested physical resources will be grouped and distributed.
Slurm follows a distinction between its physical resources -- cluster nodes and CPUs or cores on a node -- and virtual
resources, or **tasks**, which specify how requested physical resources will be grouped and distributed.
By default, Slurm will minimize the number of nodes allocated to a job, and attempt to keep the job's CPU requests
localized within a node.
**Tasks** group together CPUs (or other resources): CPUs within a task will be kept together on the same node.
Expand Down Expand Up @@ -116,7 +182,7 @@ In our prior examples, however, we used small resource requests.
What happens when we want to distribute jobs across nodes?

Slurm uses the [block distribution](https://slurm.schedmd.com/sbatch.html#OPT_block) method by default to distribute
tasks betwee nodes.
tasks between nodes.
It will exhaust all the CPUs on a node with task groups before moving to a new node.
For these examples, we're going to create a script that reports both the hostname (ie, the node) and the number
of CPUs:
Expand Down Expand Up @@ -210,4 +276,4 @@ srun: launch/slurm: _step_signal: Terminating StepId=706.0
```


### GPUs / GRES
### GPUs / GRES
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ nav:
- R and RStudio: software/rlang.md
- Development: software/developing.md
- Data Transfer: data-transfer.md
- Troubleshooting: general/troubleshooting.md
- Clusters:
- Farm:
- About: farm/index.md
Expand Down

0 comments on commit 386a925

Please sign in to comment.