Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU support to MRIReco.jl #182

Merged
merged 66 commits into from
Aug 13, 2024
Merged

Add GPU support to MRIReco.jl #182

merged 66 commits into from
Aug 13, 2024

Conversation

nHackel
Copy link
Member

@nHackel nHackel commented Apr 17, 2024

This PR adds GPU support to MRIReco.jl and the relevant subpackages. It is based on the following PRs: LinearOperators.jl#321, RegularizedLeastSquares.jl#81 and LinearOperatorCollection.jl#9

Likewise to the other PRs, I will try to keep these changes GPU-agnostic and document my high-level idea and relevant changes in the PR description.

To work on the GPU, we need all LinearOperators to have a GPU storage type. I've started adding the relevant S kwarg to the operators in the collection package and in MRIOperators.jl. Additionally, we need our solver input and dense arrays to also live on the GPU if possible. To allow for this I have added a new :arrayType (name might change) parameter to the parameter Dict, from which on can then construct correct arrays and vectors.

I'll first to keep all changes close to the Reco and operator packages. That in turn means that we need to convert the acquisition data everytime during reconstruction. An alternative could be to specifc the backing array type during data loading and derive it from there

@nHackel
Copy link
Member Author

nHackel commented Apr 17, 2024

image

And here is the first GPU reconstruction done with the changes in commit 8ecdc66. The left column shows the phantom, the middle column the CPU result and the right column the GPU result.

@aTrotier
Copy link
Contributor

Do you know a way to tests the GPU reconstruction during the CI ?

I think @cncastillo is also trying to test KomaMRI with a different CI setup for the GPU part

@nHackel
Copy link
Member Author

nHackel commented Apr 18, 2024

There are JLArrays, which are a reference implementation of GPU arrays that run on the CPU. If the code remains GPU agnostic, we can have one test-suite running on "normal" arrays and then again on JLArrays. This is the approach we aim for in RegularizedLeastSquares at the moment.

One issue with that approach is that we need to make all the operators work with JLArrays. In particular, we just call plan_fft and plan_nfft with a given array type and I don't think either work for JLArrays. But I can implement a workaround for that in LinearOperatorCollection. I had to do a similar workaround for the WaveletOp, since Wavelets don't work on the GPU

I also saw this in the Julia #gpu Slack: https://github.com/JuliaGPU/buildkite. As far as I understood it registered Julia package that belong to an organization can apply to run on their CI.

@cncastillo
Copy link
Contributor

cncastillo commented Apr 18, 2024

I also saw this in the Julia #gpu Slack: JuliaGPU/buildkite. As far as I understood it registered Julia package that belong to an organization can apply to run on their CI.

Hi! Indeed, to use JuliaGPU BuildKite CI, your package should be a part of a Julia organization. We recently moved KomaMRI to JuliaHealth for this, and besides some broken links and a few minor issues (JuliaHealth/KomaMRI.jl#335), it was straightforward.

I think @cncastillo is also trying to test KomaMRI with a different CI setup for the GPU part

We are planning on testing Koma on multiple GPUs anyway, so we will do some tests like the ones you require (here https://github.com/JuliaHealth/KomaMRI.jl/blob/master/test/runtests.jl). As we use MRIReco, we are happy to do the tests for you, but we haven't set up the Builkite CI config yet 😄. In the meantime, you can create a PR in Koma pointing to the correct version of MRIReco (this PR), adding CUDA, AMD, Metal, etc to the test environment, modify Buildkite's config (pipeline.yml) to run the tests you want. I can make you a collaborator if you need it.

Having said that, Koma's current integration with MRIReco uses RawAquisitionData as an intermediate. The MRD format inspired this custom type, so the data is forced to a specific type (which makes sense for writing to MRD). It would be nice if the GPUArrays could flow freely from the simulations to the recon; I think the main culprits are the traj and data properties of the Profile type; maybe those could be relaxed.

mutable struct Profile
  head::AcquisitionHeader
  traj::Array{Float32,2}            #<- allow AbstractArray?, but then force to F32 when writting
  data::Array{Complex{Float32},2}   #<- allow AbstractArray?, but then force to CF32 when writting
end

Something similar for AcquisitionData.

Cheers!

@nHackel
Copy link
Member Author

nHackel commented Apr 19, 2024

Getting access to BuildKite over KomaMRI.jl sounds a like great idea! I'll look into that once everything is setup here. I'll still probably setup the JLArray test case for the local tests.

I saw that LinearOperators.jl had a CI configuration where a maintainer could trigger breakage tests with upstream packages. Something like that would be a good fit for your idea @cncastillo. Then we can test it for the commits where it really matters.

I've also thought about extending my changes to the base, data and file handling packages, but for this PR I'll stick close to a refactoring of the Reco and operator packages with small changes in the other ones if necessary. For now I just want to get all the MRIReco tests and the MRIReco benchmark running on a GPU.

Once we have that running we have a good idea of what structures and information we need to run things on the GPU and then we can look into more PRs that potentially do GPU work or just more generic array compatibility in the simulation and "base" packages

@nHackel
Copy link
Member Author

nHackel commented Apr 22, 2024

Closes #168

@nHackel
Copy link
Member Author

nHackel commented Jul 31, 2024

This PR is now ready to be merged. It adds GPU support to MRIReco.jl and MRIOperators.jl. It also required smaller fixes/dependency updates to MRIBase.jl and MRISimulations.jl.

To run a reconstruction on the GPU a user needs to supply the arrayType parameter. This does not need to be a concrete GPU-array type such as CuArray or ROCArray.
It can also be something callable foo that implements the following calls:

arr = zeros(ComplexF32, 32, 32)

# 1. Convert existing array
arr_gpu = foo(arr) 

# 2. Create vector with given eltype and undef init.
v = foo{ComplexF32}(undef, 0)

"Normal" arrays also fulfil this interface. We need the first function to move things to the GPU and the second one we need to make sure that our operators have the correct storage type.

When we execute a reconstruction on the GPU we need to slightly change the reconstruction and its parameters. While at least for CUDA it seems like each tasks gets its own CUDA stream/context, we generally want to disable the multi-threaded reco loops.

This is possible for @floops by setting an executor, which we derive from the given arrayType:

executor(foo)

For AbstractGPUArrays, this function returns a sequential execution. For "custom" callable objects this would need to be implemented. Since we execute the loop sequentially, we don't need to
copy our operators, but instead can reuse them. This is again done with a new function called copyOpsFn. For GPU arrays this returns the identity.

One last change based on GPU arrays are the parameters for the FFT. Generally, GPU plan_fft calls don't accept the same parameters (or any) as the CPU version. This is again handled by a function MRIOperators.fftParams.

I've also added GPU-support to NFFT, which is implemented as a weak package extension. That should mean that AMD cards can also use the NFFT operator. I currently don't have access to one to test this and I can't use buildkite to fully test this until this PR . Once this is merged we can comment out the buildkite steps I prepared.

As @cncastillo mentioned, the next steps could be to include GPU support into the acquisition data and generally the other packages. One could derive the arrayType then from the data itself

@nHackel nHackel marked this pull request as ready for review July 31, 2024 09:03
@cncastillo
Copy link
Contributor

cncastillo commented Jul 31, 2024

This is great news :).

The current way of choosing the backend is a little weird, is this a limitation of LinearOperators.jl?

In case it is useful, @rkierulf and I worked on some functions, GPU/cpu, that automatically call the correct backend using package extensions, and you can choose the device in multi-GPU systems. So this works

using KomaMRICore
using CUDA                   # Loads KomaMRICoreCUDAExt, and set default backend to CUDA
KomaMRICore.set_device!(1)   # Not needed if you want to run it on the first (maybe only) CUDA device
x = zeros(10) |> f32 |> gpu  # CuArray{Float32}, `f64` is also available but not all the backends support it
y = myfun(x)  |> cpu         # Array{Float32} 

All the logic for these is very lightweight, and they do not have anything necessarily "Koma"-related, so they could be moved to MRIBase or a new package. The only thing required to convert a new custom type would be to tag it with @functor MyCustomType (this can manage nested structs, as it was based on Flux's adapters). This is everything I think:

Also, instead of using copyOpsFn, you may want to @view the array/operator, which already does what you described for GPU arrays, and it does not copy the array in the CPU but points to it, resulting in fewer memory allocations and potentially faster code. But I may be missing something, as I am not a LinearOperators.jl expert.

@nHackel
Copy link
Member Author

nHackel commented Jul 31, 2024

Hmm choosing the device and backend like that would be cool. And when we do that for the acq. data structs we could properly use adapt. On the reco level we kinda only have the dictionary.

Internally, we would still need to derive this callable function/arraytype in some fashion for the operators to work.
We need a concrete fully parameterized vector datatype for multiplication with our matrix-free operators. This is the S keyword essentially. This is what LinearOperators.jl constructs for us as a result if we do op * x, see here.

And we need the first variant to move our arrays to the gpu, however I just remembered I haven't done a pass over my code with Adapt.jl, so I'll do that before it is merge ready.

We can't use a concrete datatype with adapt, the function wants adapt(CuArray, ...) instead of adapt(CuArray{Float32, 1, CUDA.DeviceMemory}, ...). But we can strip a given array (so the typeof(acqData.foo)) to go to this base arraytype. See also here

@nHackel
Copy link
Member Author

nHackel commented Jul 31, 2024

Ah we can't use a view because for CPU operations we actually want to get copies of the operators because they are stateful and not thread-safe

@cncastillo
Copy link
Contributor

cncastillo commented Jul 31, 2024

Internally, we would still need to derive this callable function/arraytype in some fashion for the operators to work.

At least using KernelAbstractions.jl, there are some functions like get_backend(array) that does this. I think GPUArrays.jl will be moved to use KA.jl JuliaGPU/GPUArrays.jl#525, but until then, there seems to be an equivalent backend(array) in GPUArrays.jl.

@nHackel
Copy link
Member Author

nHackel commented Jul 31, 2024

We can derive that from a dense GPU array of the acquisition data using Base, so that's covered once we we have that.

MRIReco can also add KernelAbstractions as a dependency or your backend-selection-solution.

In MPIReco.jl I moved to KernelAbstractions.jl because I needed to use shared memory which I couldnt get to work with GPUArrays.jl. We could also port the kernels here, I just didnt have the time/it wasnt necessary so far

@tknopp tknopp merged commit a37d0ac into master Aug 13, 2024
3 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants