-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIP compilation #135
base: master
Are you sure you want to change the base?
HIP compilation #135
Conversation
std::min(uint64_t(16), util::div_ceil( nbf_max, 16 )), | ||
std::min(uint64_t(GGA_KERNEL_SM_BLOCK_Y), util::div_ceil( npts_max, GGA_KERNEL_SM_BLOCK_Y )), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the nbf_max usage here was potentially a bug? Have replaced with npts_max
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a bug, but potentially not optimal on some hardware. What have you been testing on? In principle, these parameters should be tuned, these were just the ones that were found to perform best on V100/A100.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it guaranteed that npts_max is greater than nbf_max? My understanding from a perusal of the function is that the y axis is iterating over the points rather than basis functions so the nbf_max could have been problematic? I'm still working on testing it all out, so certainly haven't got as far as performance tuning yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not guaranteed, although it usually is. I agree, what's there is likely a typo, but the kernel is hardened to take any block/grid dimension and still give the right results (i.e. whether or not loops get executed is based on the number of warps in the thread block).
I'll check to see whether this kneecaps the performance of this kernel in prod - if there's no change or it's better, I'll accept it for being "correct", if it's worse I'll come back with a hand-wavy/tin-foil-hat reason for why that's the case :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit more complicated than wavefunction91/ExchCXX#39, I'm happy to work with you on getting this fixed up, but the HIP implementation is GauXC is a bit delicate (or at least it was for older HIP/ROCm on MI250/300 a few years ago when this was a priority) - the kernels are prone to silent errors due to suboptimal resource usage emitted by the optimizing compilers (i.e. the kernel is too large for the launch parameters, and we never really hardened the implementation to do the "right" thing). Getting e.g. the uvvars
kernels to work was quite a bit of effort, IIRC.
I can provide you with some of our canonical stress test cases to validate the implementation - again, I don't have access to AMD hardware at the moment, so I can't really test anything on my end.
std::min(uint64_t(16), util::div_ceil( nbf_max, 16 )), | ||
std::min(uint64_t(GGA_KERNEL_SM_BLOCK_Y), util::div_ceil( npts_max, GGA_KERNEL_SM_BLOCK_Y )), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a bug, but potentially not optimal on some hardware. What have you been testing on? In principle, these parameters should be tuned, these were just the ones that were found to perform best on V100/A100.
@wavefunction91 @ryanstocks00 After a couple of minor tweaks to the build to enable successful
|
Thanks Ajay, yea, we don't have a HIP implementation of snK (although, it might be easy to add, would just need to tune the kernel params, it's linear along the wavefront), so we just need to add the logic to not run that test. Can you provide the diff for the build system stuff? |
No worries, I realized that just now reg. sn-K. The regular XC eval works fine. Can I commit the build system changes to this PR ? |
That would be up to @ryanstocks00. |
Yes we have had similar issues on the MI250 hardware. My main priority here was getting something that would compile so as not to break our HIP CI pipeline. I haven't yet gone through thorough testing as I was going to get it integrated on nvidia systems first. That said we do have access to AMD hardware so am happy to do whatever testing you think will be useful. I'm hopeful there will be more useful contributions going forward as I wrap my head around it all. @ajaypanyala happy for you to commit changes to this PR - do you have the required permissions to push to the branch in my repo? |
@ryanstocks00 I do not have the permissions. Could you please add me ? |
@ajaypanyala How are you testing this? Just the UTs or are you running something non-trivial (e.g. Ubi)? I'd like to get a full run of Ubi/DZ on AMD HW before merging this. Also, we might want to wait until #91 is merged as it will require additional updates to get to work with HIP - potentially some issues to work out viz runtime errors as well for large systems. Might need to coordinate with @mikovtun to get a bit test system (they have some big things they're testing with). |
@wavefunction91 Tested with Ubi/DZ (pbe0) on MI250X. |
@wavefunction91 Is this ready to go (modulo the merge conflict) ? |
@wavefunction91 @ajaypanyala I have fixed the merge conflicts so that it successfully compiles with HIP, however it fails a lot of tests, would be great to get some more experienced eyes over it |
Hi @ryanstocks00, sorry for the delay. I've recently changed jobs, so I'm still in the process of renormalizing where my time is spent. Great that this compiles. Could you provide more info on which tests are failing? That will help us pin point where things could be going wrong. It might also be worth running the standalone_driver through rocgdb to see if we can see which kernels are dying. If I recall the last time I worked with AMD/HIP, the issue was invariably silent kernel launch failures due to hard coded launch params (optimized for A100). |
9126099
to
428306d
Compare
Helps with NVHPC segfault
428306d
to
7c26939
Compare
6c58b59
to
ecf6eac
Compare
@wavefunction91 hope Microsoft is treating you well! (Is this project likely to continue much development?) The HIP code seems to be correct for LDA functionals so I think I must have stuffed something in the GGA code during the merge (not sure the MGGA code was ever fully implemented in HIP). I attempted to run rocgdb on standalone_driver as suggested using an MI250 however didn't get any errors - is this the right process? hip_mi250_standalone_b3lyp_rocgdb.txt The output is non-deterministic for both EXC and VXC. I have attached the output from running cuda_4080s_standalone_b3lyp.txt |
@ryanstocks00 Thanks, this is helpful to diagnosing the problem (I think). Lots to unpack here, I'll try to cover everything.
Yes, updated may be rolled out slower, but dev and support will continue.
Nothing obvious from what I can see from the updates, but I agree with this assessment based on the results you've shared
This should indicate that kernels are not failing to launch (although I'm not precluding it,
Looks like it
Can you expand on this? Both for LDA and GGA?
Yes, this is known #134. I'm not completely sold on the use of consumer GPUs for high-performance DFT simulations (happy to be proven wrong!), so I don't think I'll be prioritizing that for some time (that, and I don't have a consumer grade GPU to test on!) Based on the results you've shared, I think the next think to check is whether or not the LDA gradients work on AMD. If they do, we can at least preclude the collocation gradients being the problem (as well as a more extensive testing of the batched BLAS - they'll run through similar paths). If those work, then it's in one of two kernels - the Again, I don't have access to AMD hardware at the moment, so there's not a ton I can do on the debugging side. Happy to brainstorm debugging ideas though, let me know. P.S. I'm sure you're aware of this, but just to be explicit - running the standalone driver with b3lyp for benzene will fail the checks as only the SVWN5 data is saved there. I'm assuming you ran those to get the reference numbers for B3LYP? |
@wavefunction91 thank you very much for the detailed response. I don't think we can check the grads on AMD as I get a "Generic GauXC Exception (LDA Grad NYI for HIP Backends)". The output is non-deterministic for both EXC and VXC. Can you expand on this? Both for LDA and GGA? I have finally managed to get it to compile and run with HIP on a NVIDIA machine and get much the same results as on AMD (correct LDA, non-deterministic GGA) which I think rules out AMD launch configuration issues and probably warp length issues so I'm pretty sure it is a code issue. This could potentially be a way for you to test as well since the issue is replicatable on NVIDIA hardware. If there's a good way to narrow down which kernel is the issue that would be very helpful (e.g. does the fact that the "Li / SVWN5 / sto-3g" tests in gauxc_test fail despite being LDA mean anything?). |
I think I have now got gauxc running correctly with HIP on a NVIDIA machine - the problem was the vvar grad kernel which required the thread block to be square. I think this will still be problematic on AMD machines as there isn't sufficient shared memory with the increased warp length so will need some more modification though I'm going to have to put this on the backburner for a bit due to other priorities |
For the GGA energy or for the gradient? You should only call the Admittedly, I hadn't fully appreciated how complicated that code became with the latest refactor to support GKS, etc. I'll try to set aside some time to look at this over the coming week. FWIW - this set of kernels used to work (obviously), it might just require looking over the AMD commits to see what worked in the past and try to see how to translate it into the new format. Thanks for continuing to take a look at this, the effort has been very helpful and much appreciated. |
Similar to wavefunction91/ExchCXX#39, I had to make these minor changes to get GauXC to compile with HIP