HIP compilation #135

ryanstocks00 · 2024-07-18T08:46:36Z

Similar to wavefunction91/ExchCXX#39, I had to make these minor changes to get GauXC to compile with HIP

ryanstocks00 · 2024-07-18T12:17:01Z

src/xc_integrator/local_work_driver/device/cuda/kernels/uvvars.cu

-               std::min(uint64_t(16), util::div_ceil( nbf_max, 16 )),
+               std::min(uint64_t(GGA_KERNEL_SM_BLOCK_Y), util::div_ceil( npts_max, GGA_KERNEL_SM_BLOCK_Y )),


I think the nbf_max usage here was potentially a bug? Have replaced with npts_max

It's not a bug, but potentially not optimal on some hardware. What have you been testing on? In principle, these parameters should be tuned, these were just the ones that were found to perform best on V100/A100.

Is it guaranteed that npts_max is greater than nbf_max? My understanding from a perusal of the function is that the y axis is iterating over the points rather than basis functions so the nbf_max could have been problematic? I'm still working on testing it all out, so certainly haven't got as far as performance tuning yet

It's not guaranteed, although it usually is. I agree, what's there is likely a typo, but the kernel is hardened to take any block/grid dimension and still give the right results (i.e. whether or not loops get executed is based on the number of warps in the thread block).

I'll check to see whether this kneecaps the performance of this kernel in prod - if there's no change or it's better, I'll accept it for being "correct", if it's worse I'll come back with a hand-wavy/tin-foil-hat reason for why that's the case :).

wavefunction91

This is a bit more complicated than wavefunction91/ExchCXX#39, I'm happy to work with you on getting this fixed up, but the HIP implementation is GauXC is a bit delicate (or at least it was for older HIP/ROCm on MI250/300 a few years ago when this was a priority) - the kernels are prone to silent errors due to suboptimal resource usage emitted by the optimizing compilers (i.e. the kernel is too large for the launch parameters, and we never really hardened the implementation to do the "right" thing). Getting e.g. the uvvars kernels to work was quite a bit of effort, IIRC.

I can provide you with some of our canonical stress test cases to validate the implementation - again, I don't have access to AMD hardware at the moment, so I can't really test anything on my end.

wavefunction91 · 2024-07-19T18:03:37Z

src/xc_integrator/local_work_driver/device/cuda/kernels/uvvars.cu

-               std::min(uint64_t(16), util::div_ceil( nbf_max, 16 )),
+               std::min(uint64_t(GGA_KERNEL_SM_BLOCK_Y), util::div_ceil( npts_max, GGA_KERNEL_SM_BLOCK_Y )),


It's not a bug, but potentially not optimal on some hardware. What have you been testing on? In principle, these parameters should be tuned, these were just the ones that were found to perform best on V100/A100.

ajaypanyala · 2024-07-19T20:51:22Z

@wavefunction91 @ryanstocks00 After a couple of minor tweaks to the build to enable successful hipblas discovery, I was able to build the code on Frontier. I made sure this PR points to @ryanstocks00's fork of ExchCXX. However, when I run a test, I get

  what():  Generic GauXC Exception (EXX + non-CUDA NYI)
  File     /scratch/panyala/bdft/GauXC_External-prefix/src/GauXC_External/src/xc_integrator/local_work_driver/device/scheme1_base.cxx
  Function virtual void GauXC::AoSScheme1Base::eval_exx_ek_screening_bfn_stats(XCDeviceData *)
  Line     1389

wavefunction91 · 2024-07-19T21:41:07Z

Thanks Ajay, yea, we don't have a HIP implementation of snK (although, it might be easy to add, would just need to tune the kernel params, it's linear along the wavefront), so we just need to add the logic to not run that test.

Can you provide the diff for the build system stuff?

ajaypanyala · 2024-07-19T21:52:57Z

No worries, I realized that just now reg. sn-K. The regular XC eval works fine. Can I commit the build system changes to this PR ?

wavefunction91 · 2024-07-19T23:56:49Z

That would be up to @ryanstocks00.

ryanstocks00 · 2024-07-20T01:15:11Z

This is a bit more complicated than wavefunction91/ExchCXX#39, I'm happy to work with you on getting this fixed up, but the HIP implementation is GauXC is a bit delicate (or at least it was for older HIP/ROCm on MI250/300 a few years ago when this was a priority) - the kernels are prone to silent errors due to suboptimal resource usage emitted by the optimizing compilers (i.e. the kernel is too large for the launch parameters, and we never really hardened the implementation to do the "right" thing). Getting e.g. the uvvars kernels to work was quite a bit of effort, IIRC.

I can provide you with some of our canonical stress test cases to validate the implementation - again, I don't have access to AMD hardware at the moment, so I can't really test anything on my end.

Yes we have had similar issues on the MI250 hardware. My main priority here was getting something that would compile so as not to break our HIP CI pipeline. I haven't yet gone through thorough testing as I was going to get it integrated on nvidia systems first. That said we do have access to AMD hardware so am happy to do whatever testing you think will be useful. I'm hopeful there will be more useful contributions going forward as I wrap my head around it all.

@ajaypanyala happy for you to commit changes to this PR - do you have the required permissions to push to the branch in my repo?

ajaypanyala · 2024-07-20T03:40:51Z

@ajaypanyala happy for you to commit changes to this PR - do you have the required permissions to push to the branch in my repo?

@ryanstocks00 I do not have the permissions. Could you please add me ?

wavefunction91 · 2024-07-28T18:21:03Z

@ajaypanyala How are you testing this? Just the UTs or are you running something non-trivial (e.g. Ubi)? I'd like to get a full run of Ubi/DZ on AMD HW before merging this.

Also, we might want to wait until #91 is merged as it will require additional updates to get to work with HIP - potentially some issues to work out viz runtime errors as well for large systems. Might need to coordinate with @mikovtun to get a bit test system (they have some big things they're testing with).

ajaypanyala · 2024-07-29T04:52:14Z

@wavefunction91 Tested with Ubi/DZ (pbe0) on MI250X.

ajaypanyala · 2024-08-03T18:16:50Z

@wavefunction91 Is this ready to go (modulo the merge conflict) ?

ryanstocks00 · 2024-09-22T10:29:29Z

@wavefunction91 @ajaypanyala I have fixed the merge conflicts so that it successfully compiles with HIP, however it fails a lot of tests, would be great to get some more experienced eyes over it

wavefunction91 · 2024-09-26T03:15:23Z

Hi @ryanstocks00, sorry for the delay. I've recently changed jobs, so I'm still in the process of renormalizing where my time is spent.

Great that this compiles. Could you provide more info on which tests are failing? That will help us pin point where things could be going wrong. It might also be worth running the standalone_driver through rocgdb to see if we can see which kernels are dying. If I recall the last time I worked with AMD/HIP, the issue was invariably silent kernel launch failures due to hard coded launch params (optimized for A100).

Helps with NVHPC segfault

ryanstocks00 · 2024-09-30T12:37:50Z

@wavefunction91 hope Microsoft is treating you well! (Is this project likely to continue much development?)

The HIP code seems to be correct for LDA functionals so I think I must have stuffed something in the GGA code during the merge (not sure the MGGA code was ever fully implemented in HIP). I attempted to run rocgdb on standalone_driver as suggested using an MI250 however didn't get any errors - is this the right process? hip_mi250_standalone_b3lyp_rocgdb.txt The output is non-deterministic for both EXC and VXC.

I have attached the output from running ./gauxc_test on both NVIDIA and AMD - interestingly there is some small numeric noise on the 4080 that causes some of the tests to fail but I assume this is just because it is not a datacenter GPU)

cuda_4080s_standalone_b3lyp.txt
cuda_4080s_test.txt
hip_mi250_standalone.txt
hip_mi250_standalone_b3lyp.txt
hip_mi250_test.txt

wavefunction91 · 2024-10-05T23:31:47Z

@ryanstocks00 Thanks, this is helpful to diagnosing the problem (I think). Lots to unpack here, I'll try to cover everything.

Is this project likely to continue much development?

Yes, updated may be rolled out slower, but dev and support will continue.

The HIP code seems to be correct for LDA functionals so I think I must have stuffed something in the GGA code during the merge

Nothing obvious from what I can see from the updates, but I agree with this assessment based on the results you've shared

I attempted to run rocgdb on standalone_driver as suggested using an MI250 however didn't get any errors

This should indicate that kernels are not failing to launch (although I'm not precluding it, rocgdb is not the most stable software on earth)

is this the right process?

Looks like it

The output is non-deterministic for both EXC and VXC.

Can you expand on this? Both for LDA and GGA?

4080

Yes, this is known #134. I'm not completely sold on the use of consumer GPUs for high-performance DFT simulations (happy to be proven wrong!), so I don't think I'll be prioritizing that for some time (that, and I don't have a consumer grade GPU to test on!)

Based on the results you've shared, I think the next think to check is whether or not the LDA gradients work on AMD. If they do, we can at least preclude the collocation gradients being the problem (as well as a more extensive testing of the batched BLAS - they'll run through similar paths). If those work, then it's in one of two kernels - the uvvars or the zmat. Since EXC looks busted for GGAs, I'd suspect (at least) the former is the cuprit. When I was designing that kernel, it was definitely tuned for NVIDIA SM structure (e.g. warp len of 32 + square process grids within the block). I'm not sure the behaviour of e.g. the warp level reductions on AMD and whether or not there's a non-obvious race condition. This should also be a problem for LDA (i.e. if the logic is fixed there, it should be fixed everywhere), but It might be worth another look.

Again, I don't have access to AMD hardware at the moment, so there's not a ton I can do on the debugging side. Happy to brainstorm debugging ideas though, let me know.

P.S. I'm sure you're aware of this, but just to be explicit - running the standalone driver with b3lyp for benzene will fail the checks as only the SVWN5 data is saved there. I'm assuming you ran those to get the reference numbers for B3LYP?

ryanstocks00 · 2024-10-08T07:44:06Z

@wavefunction91 thank you very much for the detailed response. I don't think we can check the grads on AMD as I get a "Generic GauXC Exception (LDA Grad NYI for HIP Backends)".

The output is non-deterministic for both EXC and VXC. Can you expand on this? Both for LDA and GGA?
Using the standalone driver, EXC and VXC are both correct for LDA, both non-deterministic for GGA (and order of magnitude different to the correct value calculated with the host CPU implementation)

I have finally managed to get it to compile and run with HIP on a NVIDIA machine and get much the same results as on AMD (correct LDA, non-deterministic GGA) which I think rules out AMD launch configuration issues and probably warp length issues so I'm pretty sure it is a code issue. This could potentially be a way for you to test as well since the issue is replicatable on NVIDIA hardware. If there's a good way to narrow down which kernel is the issue that would be very helpful (e.g. does the fact that the "Li / SVWN5 / sto-3g" tests in gauxc_test fail despite being LDA mean anything?).

ryanstocks00 · 2024-10-09T09:41:52Z

I think I have now got gauxc running correctly with HIP on a NVIDIA machine - the problem was the vvar grad kernel which required the thread block to be square. I think this will still be problematic on AMD machines as there isn't sufficient shared memory with the increased warp length so will need some more modification though I'm going to have to put this on the backburner for a bit due to other priorities

wavefunction91 · 2024-10-13T23:05:17Z

vvar_grad

For the GGA energy or for the gradient? You should only call the vvar kernel for non-gradients.

Admittedly, I hadn't fully appreciated how complicated that code became with the latest refactor to support GKS, etc. I'll try to set aside some time to look at this over the coming week. FWIW - this set of kernels used to work (obviously), it might just require looking over the AMD commits to see what worked in the past and try to see how to translate it into the new format.

Thanks for continuing to take a look at this, the effort has been very helpful and much appreciated.

ryanstocks00 added 7 commits July 18, 2024 18:16

HIP compilation

ddf0071

Add hip version of mgga kernels

6070afd

Removed commented print

bb4559f

Copy paste cleanup

6d2d5f0

More missing HIP functions

6b46eb6

Missing HIP kernel

c81132f

Removed register from hip

c2e3cc2

ryanstocks00 commented Jul 18, 2024

View reviewed changes

ryanstocks00 force-pushed the master branch from d4a9624 to ca84403 Compare July 18, 2024 12:19

Reduced shared mem req

e9bf3a2

ryanstocks00 force-pushed the master branch from ca84403 to e9bf3a2 Compare July 18, 2024 12:21

wavefunction91 reviewed Jul 19, 2024

View reviewed changes

ajaypanyala added 2 commits July 19, 2024 23:32

HIP discovery fixes

e9c616d

update readme [skip ci]

b781db3

update ExchEXX hash

08ae311

ryanstocks00 added 5 commits September 21, 2024 20:24

Merge remote-tracking branch 'upstream/master'

6c80aff

Fixed HIP compilation

5b2273e

hipblas.h -> hipblas/hipblas.h

9d9145d

Renamed SM_BLOCK_Y for cuda compilation

1ad6fd4

Move a bunch of cuda -> hip

af72d05

ryanstocks00 force-pushed the master branch 4 times, most recently from 9126099 to 428306d Compare September 26, 2024 07:00

Allow passing additional flags to obara saika host compilation

7c26939

Helps with NVHPC segfault

ryanstocks00 force-pushed the master branch from 428306d to 7c26939 Compare September 26, 2024 08:14

ryanstocks00 added 2 commits September 26, 2024 22:20

Moved obara saika compile flags override

2bb4783

Compiling HIP on NVIDIA

ecf6eac

ryanstocks00 force-pushed the master branch from 6c58b59 to ecf6eac Compare September 30, 2024 10:19

ryanstocks00 added 2 commits October 8, 2024 15:53

Pseudofunctional HIP on NVIDIA

23c78e9

Fixed mem access violation

031fb0a

ryanstocks00 added 4 commits October 8, 2024 23:24

Copy zmat from cuda

eeff105

Small refactor of cuda vvar kernel to support any grid/block dims

2089af6

Revert SM block size changes

d4675df

More forceful double instead of double2

bfd8803

ryanstocks00 force-pushed the master branch from c35a39b to bfd8803 Compare October 9, 2024 05:32

AMD compilation

f0b1a51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIP compilation #135

HIP compilation #135

ryanstocks00 commented Jul 18, 2024

ryanstocks00 Jul 18, 2024 •

edited

Loading

wavefunction91 Jul 19, 2024

ryanstocks00 Jul 20, 2024

wavefunction91 Jul 20, 2024

wavefunction91 left a comment

wavefunction91 Jul 19, 2024

ajaypanyala commented Jul 19, 2024 •

edited

Loading

wavefunction91 commented Jul 19, 2024

ajaypanyala commented Jul 19, 2024

wavefunction91 commented Jul 19, 2024

ryanstocks00 commented Jul 20, 2024

ajaypanyala commented Jul 20, 2024 •

edited

Loading

wavefunction91 commented Jul 28, 2024

ajaypanyala commented Jul 29, 2024

ajaypanyala commented Aug 3, 2024

ryanstocks00 commented Sep 22, 2024

wavefunction91 commented Sep 26, 2024

ryanstocks00 commented Sep 30, 2024

wavefunction91 commented Oct 5, 2024

ryanstocks00 commented Oct 8, 2024

ryanstocks00 commented Oct 9, 2024

wavefunction91 commented Oct 13, 2024

		std::min(uint64_t(16), util::div_ceil( nbf_max, 16 )),
		std::min(uint64_t(GGA_KERNEL_SM_BLOCK_Y), util::div_ceil( npts_max, GGA_KERNEL_SM_BLOCK_Y )),

HIP compilation #135

Are you sure you want to change the base?

HIP compilation #135

Conversation

ryanstocks00 commented Jul 18, 2024

ryanstocks00 Jul 18, 2024 • edited Loading

Choose a reason for hiding this comment

wavefunction91 Jul 19, 2024

Choose a reason for hiding this comment

ryanstocks00 Jul 20, 2024

Choose a reason for hiding this comment

wavefunction91 Jul 20, 2024

Choose a reason for hiding this comment

wavefunction91 left a comment

Choose a reason for hiding this comment

wavefunction91 Jul 19, 2024

Choose a reason for hiding this comment

ajaypanyala commented Jul 19, 2024 • edited Loading

wavefunction91 commented Jul 19, 2024

ajaypanyala commented Jul 19, 2024

wavefunction91 commented Jul 19, 2024

ryanstocks00 commented Jul 20, 2024

ajaypanyala commented Jul 20, 2024 • edited Loading

wavefunction91 commented Jul 28, 2024

ajaypanyala commented Jul 29, 2024

ajaypanyala commented Aug 3, 2024

ryanstocks00 commented Sep 22, 2024

wavefunction91 commented Sep 26, 2024

ryanstocks00 commented Sep 30, 2024

wavefunction91 commented Oct 5, 2024

ryanstocks00 commented Oct 8, 2024

ryanstocks00 commented Oct 9, 2024

wavefunction91 commented Oct 13, 2024

ryanstocks00 Jul 18, 2024 •

edited

Loading

ajaypanyala commented Jul 19, 2024 •

edited

Loading

ajaypanyala commented Jul 20, 2024 •

edited

Loading