Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIP compilation #135

Open
wants to merge 26 commits into
base: master
Choose a base branch
from
Open

Conversation

ryanstocks00
Copy link

Similar to wavefunction91/ExchCXX#39, I had to make these minor changes to get GauXC to compile with HIP

std::min(uint64_t(16), util::div_ceil( nbf_max, 16 )),
std::min(uint64_t(GGA_KERNEL_SM_BLOCK_Y), util::div_ceil( npts_max, GGA_KERNEL_SM_BLOCK_Y )),
Copy link
Author

@ryanstocks00 ryanstocks00 Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the nbf_max usage here was potentially a bug? Have replaced with npts_max

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a bug, but potentially not optimal on some hardware. What have you been testing on? In principle, these parameters should be tuned, these were just the ones that were found to perform best on V100/A100.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it guaranteed that npts_max is greater than nbf_max? My understanding from a perusal of the function is that the y axis is iterating over the points rather than basis functions so the nbf_max could have been problematic? I'm still working on testing it all out, so certainly haven't got as far as performance tuning yet

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not guaranteed, although it usually is. I agree, what's there is likely a typo, but the kernel is hardened to take any block/grid dimension and still give the right results (i.e. whether or not loops get executed is based on the number of warps in the thread block).

I'll check to see whether this kneecaps the performance of this kernel in prod - if there's no change or it's better, I'll accept it for being "correct", if it's worse I'll come back with a hand-wavy/tin-foil-hat reason for why that's the case :).

Copy link
Owner

@wavefunction91 wavefunction91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit more complicated than wavefunction91/ExchCXX#39, I'm happy to work with you on getting this fixed up, but the HIP implementation is GauXC is a bit delicate (or at least it was for older HIP/ROCm on MI250/300 a few years ago when this was a priority) - the kernels are prone to silent errors due to suboptimal resource usage emitted by the optimizing compilers (i.e. the kernel is too large for the launch parameters, and we never really hardened the implementation to do the "right" thing). Getting e.g. the uvvars kernels to work was quite a bit of effort, IIRC.

I can provide you with some of our canonical stress test cases to validate the implementation - again, I don't have access to AMD hardware at the moment, so I can't really test anything on my end.

std::min(uint64_t(16), util::div_ceil( nbf_max, 16 )),
std::min(uint64_t(GGA_KERNEL_SM_BLOCK_Y), util::div_ceil( npts_max, GGA_KERNEL_SM_BLOCK_Y )),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a bug, but potentially not optimal on some hardware. What have you been testing on? In principle, these parameters should be tuned, these were just the ones that were found to perform best on V100/A100.

@ajaypanyala
Copy link
Contributor

ajaypanyala commented Jul 19, 2024

@wavefunction91 @ryanstocks00 After a couple of minor tweaks to the build to enable successful hipblas discovery, I was able to build the code on Frontier. I made sure this PR points to @ryanstocks00's fork of ExchCXX. However, when I run a test, I get

  what():  Generic GauXC Exception (EXX + non-CUDA NYI)
  File     /scratch/panyala/bdft/GauXC_External-prefix/src/GauXC_External/src/xc_integrator/local_work_driver/device/scheme1_base.cxx
  Function virtual void GauXC::AoSScheme1Base::eval_exx_ek_screening_bfn_stats(XCDeviceData *)
  Line     1389

@wavefunction91
Copy link
Owner

Thanks Ajay, yea, we don't have a HIP implementation of snK (although, it might be easy to add, would just need to tune the kernel params, it's linear along the wavefront), so we just need to add the logic to not run that test.

Can you provide the diff for the build system stuff?

@ajaypanyala
Copy link
Contributor

No worries, I realized that just now reg. sn-K. The regular XC eval works fine. Can I commit the build system changes to this PR ?

@wavefunction91
Copy link
Owner

That would be up to @ryanstocks00.

@ryanstocks00
Copy link
Author

This is a bit more complicated than wavefunction91/ExchCXX#39, I'm happy to work with you on getting this fixed up, but the HIP implementation is GauXC is a bit delicate (or at least it was for older HIP/ROCm on MI250/300 a few years ago when this was a priority) - the kernels are prone to silent errors due to suboptimal resource usage emitted by the optimizing compilers (i.e. the kernel is too large for the launch parameters, and we never really hardened the implementation to do the "right" thing). Getting e.g. the uvvars kernels to work was quite a bit of effort, IIRC.

I can provide you with some of our canonical stress test cases to validate the implementation - again, I don't have access to AMD hardware at the moment, so I can't really test anything on my end.

Yes we have had similar issues on the MI250 hardware. My main priority here was getting something that would compile so as not to break our HIP CI pipeline. I haven't yet gone through thorough testing as I was going to get it integrated on nvidia systems first. That said we do have access to AMD hardware so am happy to do whatever testing you think will be useful. I'm hopeful there will be more useful contributions going forward as I wrap my head around it all.

@ajaypanyala happy for you to commit changes to this PR - do you have the required permissions to push to the branch in my repo?

@ajaypanyala
Copy link
Contributor

ajaypanyala commented Jul 20, 2024

@ajaypanyala happy for you to commit changes to this PR - do you have the required permissions to push to the branch in my repo?

@ryanstocks00 I do not have the permissions. Could you please add me ?

@wavefunction91
Copy link
Owner

@ajaypanyala How are you testing this? Just the UTs or are you running something non-trivial (e.g. Ubi)? I'd like to get a full run of Ubi/DZ on AMD HW before merging this.

Also, we might want to wait until #91 is merged as it will require additional updates to get to work with HIP - potentially some issues to work out viz runtime errors as well for large systems. Might need to coordinate with @mikovtun to get a bit test system (they have some big things they're testing with).

@ajaypanyala
Copy link
Contributor

@wavefunction91 Tested with Ubi/DZ (pbe0) on MI250X.

@ajaypanyala
Copy link
Contributor

@wavefunction91 Is this ready to go (modulo the merge conflict) ?

@ryanstocks00
Copy link
Author

@wavefunction91 @ajaypanyala I have fixed the merge conflicts so that it successfully compiles with HIP, however it fails a lot of tests, would be great to get some more experienced eyes over it

@wavefunction91
Copy link
Owner

Hi @ryanstocks00, sorry for the delay. I've recently changed jobs, so I'm still in the process of renormalizing where my time is spent.

Great that this compiles. Could you provide more info on which tests are failing? That will help us pin point where things could be going wrong. It might also be worth running the standalone_driver through rocgdb to see if we can see which kernels are dying. If I recall the last time I worked with AMD/HIP, the issue was invariably silent kernel launch failures due to hard coded launch params (optimized for A100).

@ryanstocks00
Copy link
Author

@wavefunction91 hope Microsoft is treating you well! (Is this project likely to continue much development?)

The HIP code seems to be correct for LDA functionals so I think I must have stuffed something in the GGA code during the merge (not sure the MGGA code was ever fully implemented in HIP). I attempted to run rocgdb on standalone_driver as suggested using an MI250 however didn't get any errors - is this the right process? hip_mi250_standalone_b3lyp_rocgdb.txt The output is non-deterministic for both EXC and VXC.

I have attached the output from running ./gauxc_test on both NVIDIA and AMD - interestingly there is some small numeric noise on the 4080 that causes some of the tests to fail but I assume this is just because it is not a datacenter GPU)

cuda_4080s_standalone_b3lyp.txt
cuda_4080s_test.txt
hip_mi250_standalone.txt
hip_mi250_standalone_b3lyp.txt
hip_mi250_test.txt

@wavefunction91
Copy link
Owner

@ryanstocks00 Thanks, this is helpful to diagnosing the problem (I think). Lots to unpack here, I'll try to cover everything.

Is this project likely to continue much development?

Yes, updated may be rolled out slower, but dev and support will continue.

The HIP code seems to be correct for LDA functionals so I think I must have stuffed something in the GGA code during the merge

Nothing obvious from what I can see from the updates, but I agree with this assessment based on the results you've shared

I attempted to run rocgdb on standalone_driver as suggested using an MI250 however didn't get any errors

This should indicate that kernels are not failing to launch (although I'm not precluding it, rocgdb is not the most stable software on earth)

is this the right process?

Looks like it

The output is non-deterministic for both EXC and VXC.

Can you expand on this? Both for LDA and GGA?

4080

Yes, this is known #134. I'm not completely sold on the use of consumer GPUs for high-performance DFT simulations (happy to be proven wrong!), so I don't think I'll be prioritizing that for some time (that, and I don't have a consumer grade GPU to test on!)

Based on the results you've shared, I think the next think to check is whether or not the LDA gradients work on AMD. If they do, we can at least preclude the collocation gradients being the problem (as well as a more extensive testing of the batched BLAS - they'll run through similar paths). If those work, then it's in one of two kernels - the uvvars or the zmat. Since EXC looks busted for GGAs, I'd suspect (at least) the former is the cuprit. When I was designing that kernel, it was definitely tuned for NVIDIA SM structure (e.g. warp len of 32 + square process grids within the block). I'm not sure the behaviour of e.g. the warp level reductions on AMD and whether or not there's a non-obvious race condition. This should also be a problem for LDA (i.e. if the logic is fixed there, it should be fixed everywhere), but It might be worth another look.

Again, I don't have access to AMD hardware at the moment, so there's not a ton I can do on the debugging side. Happy to brainstorm debugging ideas though, let me know.

P.S. I'm sure you're aware of this, but just to be explicit - running the standalone driver with b3lyp for benzene will fail the checks as only the SVWN5 data is saved there. I'm assuming you ran those to get the reference numbers for B3LYP?

@ryanstocks00
Copy link
Author

@wavefunction91 thank you very much for the detailed response. I don't think we can check the grads on AMD as I get a "Generic GauXC Exception (LDA Grad NYI for HIP Backends)".

The output is non-deterministic for both EXC and VXC. Can you expand on this? Both for LDA and GGA?
Using the standalone driver, EXC and VXC are both correct for LDA, both non-deterministic for GGA (and order of magnitude different to the correct value calculated with the host CPU implementation)

I have finally managed to get it to compile and run with HIP on a NVIDIA machine and get much the same results as on AMD (correct LDA, non-deterministic GGA) which I think rules out AMD launch configuration issues and probably warp length issues so I'm pretty sure it is a code issue. This could potentially be a way for you to test as well since the issue is replicatable on NVIDIA hardware. If there's a good way to narrow down which kernel is the issue that would be very helpful (e.g. does the fact that the "Li / SVWN5 / sto-3g" tests in gauxc_test fail despite being LDA mean anything?).

@ryanstocks00
Copy link
Author

I think I have now got gauxc running correctly with HIP on a NVIDIA machine - the problem was the vvar grad kernel which required the thread block to be square. I think this will still be problematic on AMD machines as there isn't sufficient shared memory with the increased warp length so will need some more modification though I'm going to have to put this on the backburner for a bit due to other priorities

@wavefunction91
Copy link
Owner

vvar_grad

For the GGA energy or for the gradient? You should only call the vvar kernel for non-gradients.

Admittedly, I hadn't fully appreciated how complicated that code became with the latest refactor to support GKS, etc. I'll try to set aside some time to look at this over the coming week. FWIW - this set of kernels used to work (obviously), it might just require looking over the AMD commits to see what worked in the past and try to see how to translate it into the new format.

Thanks for continuing to take a look at this, the effort has been very helpful and much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants