Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve gpuistl using cudaGraphs #5852

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

multitalentloes
Copy link
Contributor

@multitalentloes multitalentloes commented Jan 8, 2025

Using cudagraphs reduces the overhead associated with many consecutive kernel launches in GPU ILU and DILU.
Code changes contain very few linechanges in the apply, though the stream has to be specified and many functions need an updated signature.

Speedups on spe1, spe11 and sleipner are typically around 1.1 to 1.2 in the preconditioners apply.

Also no speedup yet in the update, seems strange given the same kernel pattern...

template void computeDiluDiagonal<T, blocksize>( \
T*, int*, int*, int*, int*, const int, int, T*, int, cudaStream_t); \
template void computeDiluDiagonalSplit<blocksize, T, double, MatrixStorageMPScheme::DOUBLE_DIAG_DOUBLE_OFFDIAG>( \
const T*, \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-format didnt make this bit prettier...

@multitalentloes
Copy link
Contributor Author

Speedup is measured to be 10% to 20% on the apply of both ILU0 and DILU. No speedup measured for the update so I am not introducing cudaGraphs there, no idea why we do not see the same runtime reduction when the update also has lots of short kernels...

Still not mergable as the results are not verified on a consumer grade AMD-card to ensure there is no slowdown there (speedup is not expected with current rocm versions and current hardware generation)

@multitalentloes multitalentloes marked this pull request as ready for review January 20, 2025 08:01
@multitalentloes
Copy link
Contributor Author

Now tested on AMD - 1.02 speedup on the ILU0 apply, though I am not sure this is significant/consistently reproducible.
Seems safe to use the graphs.

is also supported in HIP, though not does not
seem to affect performance in any clear way.
1.1 to 1.2 speedup in Nvidia GPUs.
@@ -174,6 +174,19 @@ GpuBuffer<T>::copyFromHost(const T* dataPointer, size_t numberOfElements)
OPM_GPU_SAFE_CALL(cudaMemcpy(data(), dataPointer, numberOfElements * sizeof(T), cudaMemcpyHostToDevice));
}

template <class T>
void
GpuBuffer<T>::copyFromHost(const T* dataPointer, size_t numberOfElements, cudaStream_t stream)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

various functions have been changed to include the cudastream, this is because to record a set of GPU activities in a cudaGraph, a created stream must be used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant