Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lumi-CPEtools with support for hpcat. #160

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions easybuild/easyconfigs/l/lumi-CPEtools/LICENSE.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,9 @@ and licensed under the GNU General Public License version 3.0, a copy of
which can be found in the
[LICENSE file in the source repository](https://github.com/Lumi-supercomputer/lumi-CPEtools/blob/main/LICENSE).

The `hpcat` tool included in the module is developed by HPE and licensed under an MIT-style
license which can be found in the
[LICENSE file in the source repository](https://github.com/HewlettPackard/hpcat/blob/main/LICENSE).

After loading the module, both license files are available in their respective subdirectories
in `$EBROOTLUMIMINCPETOOLS/share/licenses`.
39 changes: 39 additions & 0 deletions easybuild/easyconfigs/l/lumi-CPEtools/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ lumi-CPEtools is developed by the LUST team.

- [lumi-CPEtools on GitHub](https://github.com/Lumi-supercomputer/lumi-CPEtools)

- The [hpcat utility from HPE](https://github.com/HewlettPackard/hpcat) on GitHub


## EasyBuild

Expand Down Expand Up @@ -39,3 +41,40 @@ for the CrayEnv software stack.

- It looks like the compiler wrappers have changed in 24.03 as unloading the accelerator
target module in the cpeAMD version was no longer needed.


### Version 1.2

- Transformed the EasyConfig from version 1.1 to a Bundle to be able to add `hpcat`
using its own installation procedure.

- Building `hpcat`:

- LUMI lacks the `hwloc-devel` package so we simply copied the header files from another system
and download them from LUMI-O.

- The Makefile was modified to integrate better with EasyBuild and to work around a problem with
finding the `hwloc` library on LUMI.

Rather than writing a new Makefile or a patch, we actually used a number of `sed` commands to edit
the Makefile:

- `mpicc` was replaced with `$(CC)` so that the wrappers are used instead.
- `-O3` was replaced with `$(CFLAGS)` to pick up the options from EasyBuild
- '-fopenmp' is managed by the Makefile though and not by EasyBuild. On one hand because the
ultimate goal is to integrate with another packages that sometimes needs and sometimes does not
need the OpenMP flags, on the other hand to use `$(CFLAGS)` also for `hipcc`.
- `-lhwloc` is replaced with `-Wl,/usr/lib64/libhwloc.so.15`. We had to do this through `-Wl` as
the `hipcc` driver thought this was a source file.
- As '-L.' is not needed, it is omitted.

- As there is no `make install`, we simply use the `MakeCp` EasyBlock, doing the edits to the Makefile in
`prebuiltopts`.

Not that we copy the `libhip.so` file to the `lib` directory as that is the conventional
place to store shared objects, but it is not found there by `hpcat`, so we also create a
symbolic link to it in the `bin` subidrecitory.

- Note that the accelerator target module should not be loaded when using the wrappers as the OpenMP offload
options cause a problem in one of the header files used.

71 changes: 71 additions & 0 deletions easybuild/easyconfigs/l/lumi-CPEtools/USER.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,78 @@ Commands provided:

- `gpu_check` (from version 1.1 on): A hybrid MPI/OpenMP program that prints information about thread and GPU binding/mapping on Cray EX Bardpeak nodes as in
LUMI-G, based on the ORNL hello_jobstep program. (AMD GPU nodes only)

- `hpcat` (from version 1.2 on): Another HPC Affinity Tracker program. This program
is [developed by HPE](https://github.com/HewlettPackard/hpcat) and shows for
each MPI rank the core(s) that will be used (and per thread if `OMP_NUM_THREADS`
is set), which GPU(s) are accessible to the task and which network adapter will be
used, indicating the NUMA domain for each so that one can easily check if the resource
mapping is ideal.

The various `*_check` programs are designed to test CPU and GPU binding in Slurm and
are the LUST recommended way to experiment with those bindings.


## Some interactive examples

The examples assume the appropriate software stack modules and `lumi-CPEtools` module
are loaded. The examples show one version of modules, but can work with others too.
You'll also need to add the appropriate `-A` flag to the `salloc` commands.


### `gpu_check`

```
salloc -N2 -pstandard-g -G 16 -t 10:00
module load LUMI/24.03 partition/G lumi-CPEtools/1.2-cpeGNU-24.03
srun -n16 -c7 bash -c 'ROCR_VISIBLE_DEVICES=$SLURM_LOCALID gpu_check -l'
srun -n16 -c7 \
--cpu-bind=mask_cpu:0xfe000000000000,0xfe00000000000000,0xfe0000,0xfe000000,0xfe,0xfe00,0xfe00000000,0xfe0000000000 \
bash -c 'ROCR_VISIBLE_DEVICES=$SLURM_LOCALID gpu_check -l'
```

Note that in the first `srun` command, the mapping of GPU binding is not optimal while
in the second `srunz` command it is.


### `hpcat` on a GPU node

```
salloc -N2 -pstandard-g -G 16 -t 10:00
module load LUMI/24.03 partition/G lumi-CPEtools/1.2-cpeGNU-24.03
srun -n16 -c7 bash -c 'ROCR_VISIBLE_DEVICES=$SLURM_LOCALID OMP_NUM_THREADS=7 hpcat'
srun -n16 -c7 \
--cpu-bind=mask_cpu:0xfe000000000000,0xfe00000000000000,0xfe0000,0xfe000000,0xfe,0xfe00,0xfe00000000,0xfe0000000000 \
bash -c 'ROCR_VISIBLE_DEVICES=$SLURM_LOCALID OMP_NUM_THREADS=7 hpcat'
srun -n16 -c7 \
--cpu-bind=mask_cpu:0xfe000000000000,0xfe00000000000000,0xfe0000,0xfe000000,0xfe,0xfe00,0xfe00000000,0xfe0000000000 \
bash -c 'ROCR_VISIBLE_DEVICES=$SLURM_LOCALID OMP_NUM_THREADS=7 OMP_PLACES=cores hpcat'
```

Note that in the first `srun` command, the mapping of resources is not very good. GPUs
don't map to their closest chiplet, and the network adapters are also linked based
on the CPU NUMA domain. In the second case, the mapping is optimal, but except for the
Cray compilers, the OpenMP threads can still move in the chiplet. In the last case, these
are also fixed with all compilers.


### `serial_check`, `omp_check`, `mpi_cehck` and `hybrid_check`

```
salloc -N1 -pstandard -t 10:00
module load LUMI/24.03 partition/C lumi-CPEtools/1.2-cpeGNU-24.03
srun -n1 -c1 serial_check
srun -n1 -c4 omp_check
srun -n4 -c1 mpi_check
srun -n4 -c4 hybrid_check
```

One big difference between these tools and `hpcat` is that this tool shows on which
core a thread is running at the moment that this is measured, while `hpcat` actually
shows the affinity mask, i.e., all cores that can be used by that thread. `gpu_check`
has the same limitation as `omp_check` etc.


## Acknowledgements

The code for `hybrid_check` and its offsprings `serial_check`, `omp_check` and `mpi_check` is inspired
Expand All @@ -39,3 +106,7 @@ without reworking on other AMD GPU systems or on NVIDIA GPU systems.

The `lumi-CPEtools` code is developed by LUST in the
[lumi-CPEtools repository on the LUMI supercomputer GitHub](https://github.com/Lumi-supercomputer/lumi-CPEtools).

The `hpcat` program (lumi-CPEtools 1.2 and later) is
[developed by HPE](https://github.com/HewlettPackard/hpcat) and provided
unmodified.
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
easyblock = 'Bundle'

local_CPEtools_version = '1.2'
local_hpcat_version = '0.4'

name = 'lumi-CPEtools'
version = local_CPEtools_version

homepage = 'https://www.lumi-supercomputer.eu'

whatis = [
'Description: Various programs to experiment with starting processes and core affinity and analyse executables.',
]

description = """
The LUMI-CPEtools module provides various programs to experiment with starting
applications of various types and with core affinity and to show which Cray PE
libraries are used by an executable. It may be enhanced with additional
features in the future.

Sources for all but hpcat can be accessed after loading the module in the
directory $EBROOTLUMIMINCPETOOLS/src.
"""

docurls = [
"Man pages, start with man lumi-CPEtools",
]

toolchain = {'name': 'cpeAMD', 'version': '24.03'}
# Note: The Makefile is designed to work with the compiler variables as defined
# when usempi and openmp are both false, as the module contains code with and
# without MPI or OpenMP support, though it would probably still work if these
# are set to true as usually it does no harm to compile with the MPI wrappers
# or OpenMP options enabled even if the sources don't use these.
toolchainopts = {'usempi': False,'openmp': False, 'extra_cxxflags': '-std=c++11'}

builddependencies = [
('buildtools', '%(toolchain_version)s', '', SYSTEM), # For make
]

import os as local_os
local_partition = local_os.getenv('LUMI_STACK_PARTITION')

start_dir = 'src'

#
# Options for lumi-CPEtools
#
local_CPEtools_buildopts = 'CC=cc MPICC=cc CFLAGS="-O2" OMPFLAG="-fopenmp" CXX=CC MPICXX=CC CXXFLAGS="-O2 -std=c++11" '
local_CPEtools_buildopts += 'DEFINES="-D_GNU_SOURCE=" ROCMDEFINES="-D__HIP_PLATFORM_AMD__" '
local_CPEtools_buildopts += 'LIBS="" ROCMLIBS="-L${ROCM_PATH}/lib -lamdhip64" '

if local_partition == 'G':
local_CPEtools_buildopts += 'exe_cpu exe_gpu'
else:
local_CPEtools_buildopts += 'exe_cpu'

#
# Options for hpcat
#
local_hpcat_prebuildopts = 'sed -i -e \'s|-O3|$(CFLAGS)|\' -e \'s|mpicc|$(CC)|\' -e \'s|gcc|$(CC)|\' Makefile && '
local_hpcat_prebuildopts += 'sed -i -e \'s|-lhwloc|-Wl,/usr/lib64/libhwloc.so.15|\' -e \'s|-L. ||\' Makefile && '
if local_partition == 'G':
local_hpcat_prebuildopts += 'module unload craype-accel-amd-gfx90a && '

if local_partition == 'G':
local_hpcat_buildopts = 'all amd'
else:
local_hpcat_buildopts = 'all'

local_hpcat_files_to_copy = [ (['hpcat'], 'bin'), (['LICENSE'], 'share/licenses/hpcat') ]
if local_partition == 'G':
local_hpcat_files_to_copy += [ (['hpcathip.so'], 'lib') ]


default_easyblock = 'MakeCp'

components = [
('lumi-CPEtools', local_CPEtools_version, {
'sources': [{ # https://github.com/Lumi-supercomputer/lumi-CPEtools/archive/refs/tags/1.2.tar.gz
'download_filename': '%(version)s.tar.gz',
'filename': 'lumi-CPEtools-%(version)s.tar.gz',
'source_urls': ['https://github.com/Lumi-supercomputer/%(name)s/archive/refs/tags']
}],
'checksums': [{ f'lumi-CPEtools-{version}.tar.gz': '89e1a01d9ecd30da53c70d866aa4a2d60853e16b2cc2da4b23589742f8d84e86' }],
'start_dir': '%(name)s-%(version)s',
'prebuildopts': 'cd src && unset LIBRARY_PATH && ', # Cannot do the directory change via start_dir as otherwise the bin etc directories are not found in the copy step.
'buildopts': local_CPEtools_buildopts,
'files_to_copy': [ 'bin', 'man', 'src', 'README.md', (['LICENSE'], 'share/licenses/lumi-CPEtools') ],
}),
( 'hpcat', local_hpcat_version, {
'sources': [{ # HPCAT 0.4: https://github.com/HewlettPackard/hpcat/archive/refs/tags/v0.4.tar.gz
'download_filename': 'v%(version)s.tar.gz',
'filename': '%(name)s-%(version)s.tar.gz',
'source_urls': ['https://github.com/HewlettPackard/hpcat/archive/refs/tags']
},{ # Missing header files for hwloc from https://lumidata.eu/462000008:missing-libraries-headers/hwloc-devel-15SP5.tar.bz2
'filename': 'hwloc-devel-15SP5.tar.bz2',
'source_urls': ['https://lumidata.eu/462000008:missing-libraries-headers'],
'extract_cmd': 'cd %(name)s-%(version)s && tar xvf %s',
}],
'checksums': [
{ f'hpcat-{local_hpcat_version}.tar.gz': 'dfad8649a5cc75c07deabbd5682b22fe0fdd650de14b382cccc5244a27b439ab' },
{ 'hwloc-devel-15SP5.tar.bz2': '3ca23fcaca9ba05e44e6816076230f356e658648b7e1747241f4bd3632011582' }
],
'start_dir': '%(name)s-%(version)s',
'prebuildopts': local_hpcat_prebuildopts,
'buildopts': local_hpcat_buildopts,
'files_to_copy': local_hpcat_files_to_copy,
})
]

if local_partition == 'G':
# Postinstall for hpcat
postinstallcmds = [ 'cd %(installdir)s/bin && ln -s %(installdir)s/lib/hpcathip.so' ]

#
# Sanity checks
#

local_exe = [ 'serial_check', 'omp_check', 'mpi_check', 'hybrid_check', 'xldd', 'hpcat' ]
if local_partition == 'G':
local_exe += [ 'gpu_check', 'hpcat.so' ]

local_files = [ 'share/licenses/lumi-CPEtools/LICENSE', 'share/licenses/hpcat/LICENSE' ]
if local_partition == 'G':
local_files += [ 'lib/hpcathip.so' ]

sanity_check_paths = {
'files': [ 'bin/%s' % x for x in [ 'serial_check', 'omp_check', 'mpi_check', 'hybrid_check', 'xldd' ] ] +
local_files,
'dirs': [ 'man/man1', 'src' ]
}

sanity_check_commands = [
'xldd --help',
'serial_check',
'OMP_NUM_THREADS=4 omp_check',
'mpi_check',
'OMP_NUM_THREADS=4 hybrid_check',
'hpcat --version',
]
if local_partition == 'G':
sanity_check_commands += [ 'gpu_check -h' ]

moduleclass = 'devel'
Loading