Added HBM info for Young. Updated quota info for Young and Michael.

UCL-ARC · Oct 31, 2024 · 985b8e8 · 985b8e8
1 parent f4b0025
commit 985b8e8
Show file tree

Hide file tree

Showing 2 changed files with 52 additions and 11 deletions.
diff --git a/mkdocs-project-dir/docs/Clusters/Michael.md b/mkdocs-project-dir/docs/Clusters/Michael.md
@@ -360,7 +360,10 @@ If your job must run within a single CU, you can request the parallel environmen
 
 ## Disk quotas
 
-You have one per-user quota, with a default amount of 250GB - this is the total across home and Scratch.
+You have per-user quotas for home and Scratch.
+
+  - home: 100G quota, backed up, no increases available
+  - Scratch: 250G quota by default, not backed up, increases possible
 
   - `lquota` shows you your quota and total usage (twice).
   - `request_quota` is how you request a quota increase.

diff --git a/mkdocs-project-dir/docs/Clusters/Young.md b/mkdocs-project-dir/docs/Clusters/Young.md
@@ -18,7 +18,7 @@ In particular the "Software Training" section:
    details of hardware, how to submit jobs and an overview of types of parallelism
  - [A quick 4 minute overview of how to choose memory](https://www.youtube.com/watch?v=pYspFuxbWjs)
  - [Longer videos on memory allocation for new users](https://www.youtube.com/watch?v=D3iF_N0cwv4)
-   - [A hands-on memory allocation example](https://www.youtube.com/watch?v=QlMaN_ECIwg) 
+    - [A hands-on memory allocation example](https://www.youtube.com/watch?v=QlMaN_ECIwg) 
 
 ### MMM Hub: HPE / NVIDIA GPU Training Day
 
@@ -325,6 +325,7 @@ facilities.
 | Free CPU job, any         | 5120  | 0    | 24hrs         |
 | Free GPU job, any         | 320   | 40   | 48hrs         |
 | Free GPU fast interactive | 64    | 8    | 6hrs          |
+| HBM CPU job, any          | 2048  | 0    | 48hrs         |
 
 CPU jobs or [GPU jobs](#GPU_nodes) can be run on Young, and there are 
 different [nodes](#Node_types) dedicated for each.
@@ -362,15 +363,17 @@ If your job must run within a single CU, you can request the parallel environmen
 
 ## Node types
 
-Young has four types of node: standard nodes, big memory nodes, really big memory nodes 
-and GPU nodes. Note those last two have different processors and number of CPU cores per node.
+Young has five types of node: standard nodes, big memory nodes, really big memory nodes, 
+GPU nodes and HBM nodes. Note those last three have different processors and number of 
+CPU cores per node.
 
 | Type  | Cores per node | RAM per node | tmpfs | Nodes | Memory request necessary | GPU |
 | ----- | -------------- | ------------ | ----- | ----- | ------------------------ | --- |
-| C     | 40             | 192G         | None  | 576   | Any | None |
+| C     | 40             | 192G (188G usable) | None  | 576   | Any | None |
 | Y     | 40             | 1.5T         | None  | 3     | mpi: mem >=19G, smp: >186G total | None |
 | Z     | 36             | 3.0T         | None  | 3     | mpi: mem >=42G, smp: >1530G total | None |
 | X     | 64             | 1T           | 200G  | 6     | Any | 8 x Nvidia 40G A100 |
+| W     | 64             | (503G usable) | 3.5T  | 32    | Any | None |
 
 These are numbers of physical cores: multiply by two for virtual cores with
 hyperthreading. 
@@ -386,6 +389,7 @@ Here are the processors each node type has:
   - Y: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
   - Z: Intel(R) Xeon(R) Gold 6240M CPU @ 2.60GHz
   - X: dual AMD EPYC 7543 32-Core Processor
+  - W: Intel (R) Xeon(R) Gold Max
 
 (If you ever need to check this, you can include `cat /proc/cpuinfo` in your jobscript so 
 you get it in your job's .o file for the exact node your job ran on. You will get an entry
@@ -398,7 +402,37 @@ future.
 
 [How to use the GPU nodes](../Supplementary/Young_GPU_Nodes.md).
 
-### Restricting to one node type
+
+## High Bandwidth Memory nodes
+
+The HBM nodes have 64GB of integrated High Bandwidth Memory per socket and two sockets per node.
+
+HBM nodes can be set on the system level in two modes. 
+
+* HBM cache mode: In cache mode, HBM functions as a memory-side cache for contents of DDR memory. 
+  In this mode, HBM is transparent to all software because the HBM cache is managed by hardware 
+  memory controllers. No code changes are required.
+
+* HBM flat mode: In flat mode, both DDR and the HBM address spaces are visible to software. 
+  Applications may need to be modified or tuned to be aware of the additional memory hierarchy.
+
+At present, we have the nodes set in cache mode. We will be re-evaluating this after the 
+operating system is upgraded and will have full support for flat mode - at this point we 
+may have some nodes in each mode to allow you to experiment.
+
+There are more details about HBM and the modes at [Enabling High Bandwidth Memory for HPC and AI Applications for Next Gen Intel Xeon Processors](https://community.intel.com/t5/Blogs/Products-and-Solutions/HPC/Enabling-High-Bandwidth-Memory-for-HPC-and-AI-Applications-for/post/1335100)
+
+### Requesting HBM nodes
+
+You need to request these nodes explicitly in your job.
+
+```
+# Request HBM nodes
+#$ -ac allow=W
+```
+
+
+## Restricting to one node type
 
 The scheduler will schedule your job on the relevant nodetype 
 based on the resources you request, but if you really need to specify 
@@ -467,19 +501,23 @@ process per node which would create 80 threads on the node (on Hyperthreads).
 
 ## Diskless nodes
 
-Young CPU nodes are diskless (have no local hard drives) - there is no `$TMPDIR` 
-available, so you should not request `-l tmpfs=10G` in your 
+Young standard and big memory CPU nodes are diskless (have no local hard drives) - 
+there is no `$TMPDIR` available, so you should not request `-l tmpfs=10G` in your 
 jobscripts or your job will be rejected at submit time.
 
 If you need temporary space, you should use somewhere in your Scratch.
 
+The GPU nodes and HBM nodes do have disks and so `tmpfs` can be requested there.
 
 ## Disk quotas
 
-You have one per-user quota, with a default amount of 250GB - this is the total across home and Scratch.
+You have per-user quotas for home and Scratch.
+
+  - home: 100G quota, backed up, no increases available
+  - Scratch: 250G quota by default, not backed up, increases possible
 
-  - `lquota` shows you your quota and total usage (twice).
-  - `request_quota` is how you request a quota increase.
+  - `lquota` shows you your quota and total usage.
+  - `request_quota` is how you request a Scratch quota increase.
 
 If you go over quota, you will no longer be able to create new files and your jobs will fail as they cannot write.