Skip to content

Commit

Permalink
Added HBM info for Young. Updated quota info for Young and Michael.
Browse files Browse the repository at this point in the history
  • Loading branch information
heatherkellyucl committed Oct 31, 2024
1 parent f4b0025 commit 985b8e8
Show file tree
Hide file tree
Showing 2 changed files with 52 additions and 11 deletions.
5 changes: 4 additions & 1 deletion mkdocs-project-dir/docs/Clusters/Michael.md
Original file line number Diff line number Diff line change
Expand Up @@ -360,7 +360,10 @@ If your job must run within a single CU, you can request the parallel environmen

## Disk quotas

You have one per-user quota, with a default amount of 250GB - this is the total across home and Scratch.
You have per-user quotas for home and Scratch.

- home: 100G quota, backed up, no increases available
- Scratch: 250G quota by default, not backed up, increases possible

- `lquota` shows you your quota and total usage (twice).
- `request_quota` is how you request a quota increase.
Expand Down
58 changes: 48 additions & 10 deletions mkdocs-project-dir/docs/Clusters/Young.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ In particular the "Software Training" section:
details of hardware, how to submit jobs and an overview of types of parallelism
- [A quick 4 minute overview of how to choose memory](https://www.youtube.com/watch?v=pYspFuxbWjs)
- [Longer videos on memory allocation for new users](https://www.youtube.com/watch?v=D3iF_N0cwv4)
- [A hands-on memory allocation example](https://www.youtube.com/watch?v=QlMaN_ECIwg)
- [A hands-on memory allocation example](https://www.youtube.com/watch?v=QlMaN_ECIwg)

### MMM Hub: HPE / NVIDIA GPU Training Day

Expand Down Expand Up @@ -325,6 +325,7 @@ facilities.
| Free CPU job, any | 5120 | 0 | 24hrs |
| Free GPU job, any | 320 | 40 | 48hrs |
| Free GPU fast interactive | 64 | 8 | 6hrs |
| HBM CPU job, any | 2048 | 0 | 48hrs |

CPU jobs or [GPU jobs](#GPU_nodes) can be run on Young, and there are
different [nodes](#Node_types) dedicated for each.
Expand Down Expand Up @@ -362,15 +363,17 @@ If your job must run within a single CU, you can request the parallel environmen

## Node types

Young has four types of node: standard nodes, big memory nodes, really big memory nodes
and GPU nodes. Note those last two have different processors and number of CPU cores per node.
Young has five types of node: standard nodes, big memory nodes, really big memory nodes,
GPU nodes and HBM nodes. Note those last three have different processors and number of
CPU cores per node.

| Type | Cores per node | RAM per node | tmpfs | Nodes | Memory request necessary | GPU |
| ----- | -------------- | ------------ | ----- | ----- | ------------------------ | --- |
| C | 40 | 192G | None | 576 | Any | None |
| C | 40 | 192G (188G usable) | None | 576 | Any | None |
| Y | 40 | 1.5T | None | 3 | mpi: mem >=19G, smp: >186G total | None |
| Z | 36 | 3.0T | None | 3 | mpi: mem >=42G, smp: >1530G total | None |
| X | 64 | 1T | 200G | 6 | Any | 8 x Nvidia 40G A100 |
| W | 64 | (503G usable) | 3.5T | 32 | Any | None |

These are numbers of physical cores: multiply by two for virtual cores with
hyperthreading.
Expand All @@ -386,6 +389,7 @@ Here are the processors each node type has:
- Y: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
- Z: Intel(R) Xeon(R) Gold 6240M CPU @ 2.60GHz
- X: dual AMD EPYC 7543 32-Core Processor
- W: Intel (R) Xeon(R) Gold Max

(If you ever need to check this, you can include `cat /proc/cpuinfo` in your jobscript so
you get it in your job's .o file for the exact node your job ran on. You will get an entry
Expand All @@ -398,7 +402,37 @@ future.

[How to use the GPU nodes](../Supplementary/Young_GPU_Nodes.md).

### Restricting to one node type

## High Bandwidth Memory nodes

The HBM nodes have 64GB of integrated High Bandwidth Memory per socket and two sockets per node.

HBM nodes can be set on the system level in two modes.

* HBM cache mode: In cache mode, HBM functions as a memory-side cache for contents of DDR memory.
In this mode, HBM is transparent to all software because the HBM cache is managed by hardware
memory controllers. No code changes are required.

* HBM flat mode: In flat mode, both DDR and the HBM address spaces are visible to software.
Applications may need to be modified or tuned to be aware of the additional memory hierarchy.

At present, we have the nodes set in cache mode. We will be re-evaluating this after the
operating system is upgraded and will have full support for flat mode - at this point we
may have some nodes in each mode to allow you to experiment.

There are more details about HBM and the modes at [Enabling High Bandwidth Memory for HPC and AI Applications for Next Gen Intel Xeon Processors](https://community.intel.com/t5/Blogs/Products-and-Solutions/HPC/Enabling-High-Bandwidth-Memory-for-HPC-and-AI-Applications-for/post/1335100)

### Requesting HBM nodes

You need to request these nodes explicitly in your job.

```
# Request HBM nodes
#$ -ac allow=W
```


## Restricting to one node type

The scheduler will schedule your job on the relevant nodetype
based on the resources you request, but if you really need to specify
Expand Down Expand Up @@ -467,19 +501,23 @@ process per node which would create 80 threads on the node (on Hyperthreads).

## Diskless nodes

Young CPU nodes are diskless (have no local hard drives) - there is no `$TMPDIR`
available, so you should not request `-l tmpfs=10G` in your
Young standard and big memory CPU nodes are diskless (have no local hard drives) -
there is no `$TMPDIR` available, so you should not request `-l tmpfs=10G` in your
jobscripts or your job will be rejected at submit time.

If you need temporary space, you should use somewhere in your Scratch.

The GPU nodes and HBM nodes do have disks and so `tmpfs` can be requested there.

## Disk quotas

You have one per-user quota, with a default amount of 250GB - this is the total across home and Scratch.
You have per-user quotas for home and Scratch.

- home: 100G quota, backed up, no increases available
- Scratch: 250G quota by default, not backed up, increases possible

- `lquota` shows you your quota and total usage (twice).
- `request_quota` is how you request a quota increase.
- `lquota` shows you your quota and total usage.
- `request_quota` is how you request a Scratch quota increase.

If you go over quota, you will no longer be able to create new files and your jobs will fail as they cannot write.

Expand Down

0 comments on commit 985b8e8

Please sign in to comment.