Skip to content

OpenACC + Cray CCE + AMD MI200+ #368

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 121 commits into from
Apr 6, 2024
Merged

Conversation

anandrdbz
Copy link
Contributor

@anandrdbz anandrdbz commented Mar 8, 2024

Description

Adds support for MI200+ GPUs via CCE compilers and OpenACC.

Type of change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)

Scope

  • This PR comprises a set of related changes with a common goal

Closes #352 #383 #384

Test Configuration:

  • What computers and compilers did you use to test this:
    • OLCF Frontier

henryleberre and others added 30 commits May 8, 2023 13:38
Cray's gray skies

getting somewhere...

the leaves are brown and the sky is cray

cray is driving us cray cray

not waitin' t'ill we're old and cray

fixing some problems cray-ted by cray

cray-ving simpler times

cray-shing and burning

we're too cray-tive

cray-ing in pain

cray-ving PTO

hmm
There's multiple facets to these changes to fix CCE handling allocatable
module arrays:

1) CCE only has a problem with allocatable module arrays, not scalars or
statically sized arrays. Use declare create for everything that isn't allocatable.
2) Allocatable array handling is broken in CCE <= 16.0.0, but an effective
(but ugly) workaround is to add a pointer in front of the allocatable
and leverage the pointer attachment logic to fixup the link.
The CRAY_DECLARE_GLOBAL macro declares the shadow allocatable and the pointer.
The ALLOCATE_GLOBAL macro allocates the shadow, attaches the pointer,
and creates/attaches each on the device.
The DEALLOCATE_GLOBAL macro detaches and releases the device entries,
nullifies the pointer, and deallocates the shadow.
The ALLOCATE and DEALLOCATE macros can be used for local variables or
derived type components.

This commit still isn't functional on AMD GPUs due to some register allocation
issues in a specific function. Those will be addressed in a future commit.
Loop bound variables don't need to be mapped.
G2 will do a debug build without disabling loop collapsing.
Too much state+too many forks can hit a compiler bug
around SGPR spillage.
This may not be needed, and will probably be removed.
It's possible that CCE is not mangling the private global names on the device
in a way that makes them unique, which could be causing problems.

Most of these don't need to declared to the device, though.
They're loop bounds which shouldn't need to be mapped.

Need to check if these WARs are actually neeeded.
There's a possible CCE issue with subarrays on the device.
Not clear if it's a real problem or just a dev compiler build problem.
By default CCE fortran will try to make kernels async,
and do some addressing tweaks.
These can sometimes cause problems, so turning them off
is almost always a good debugging step.
This is related to some screwy variable name mangling, and might not be necessary.
This was causing a lot of build failures.
This seems to be an endemic change, we may need to roll more back
@sbryngelson
Copy link
Member

sbryngelson commented Apr 5, 2024

Change ./mfc.sh load compute name from Crusher to Frontier

Update: Did this myself in 110a290

@sbryngelson
Copy link
Member

sbryngelson commented Apr 5, 2024

./mfc.sh test -a -- -c frontier does not work.

Specifically:

FileNotFoundError: [Errno 2] No such file or directory:
'/lustre/orion/cfd154/scratch/sbryngelson/MFC/build/install/dependencies/bin/h5d
ump'

and

sbryngelson/scratch $ ls MFC/build/install/dependencies/bin/
hipfc

@sbryngelson
Copy link
Member

sbryngelson commented Apr 5, 2024

It's looking like Frontier CI may fail for the 2-rank case. Tests were run with

./mfc.sh test -j 8 -- -c frontier

The test MFC.sh file in the 2-rank directory reads

(set -x; srun -N 1 -n 2 "/lustre/orion/cfd154/scratch/sbryngelson/runner/actions-runner/_work/MFC/MFC/build/install/0571538fd2/bin/simulation")

which appears to be the problem, it should be passing --ntasks-per-node (or whatever) since we are using -- -c frontier

Update: It passed on second try 🤷

@sbryngelson
Copy link
Member

@henryleberre, do you know why it doesn't build h5dump? (or at least it isn't found in the expected bin/ directory)

@henryleberre
Copy link
Member

henryleberre commented Apr 5, 2024

@sbryngelson We opted not to build HDF5 on CCE. I forget why, perhaps there were some incompatibilities. We use the cray-hdf5 module so h5dump should already be available.

@sbryngelson
Copy link
Member

sbryngelson commented Apr 5, 2024

@henryleberre you are correct, h5dump is already in the path. It looks like the problem is that using test -a forces it to look in dependencies/bin/h5dump for the binary (rather than the path broadly). Is there a fix for this?

Here:
./mfc/test/test.py: h5dump = f"{HDF5.get_install_dirpath()}/bin/h5dump"

It does look like we have this option:

            if ARG("no_hdf5"):
                if not does_command_exist("h5dump"):
                    raise MFCException("--no-hdf5 was specified and h5dump couldn't be   found.")

                h5dump = shutil.which("h5dump")

though it doesn't seem to be working like this
./mfc.sh test -a j 1 -- -c frontier --no-hdf5

@henryleberre
Copy link
Member

henryleberre commented Apr 5, 2024

@sbryngelson I'm testing a fix. For your command, you would have to use this instead:

$ ./mfc.sh test -a --no-hdf5 -- -c frontier

@sbryngelson
Copy link
Member

@henryleberre this works!

@sbryngelson
Copy link
Member

closes #352 #383 #384

@sbryngelson sbryngelson merged commit 2b3d35d into MFlowCode:master Apr 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
5 participants