FoldsCUDA.jl provides
Transducers.jl-compatible
fold (reduce) implemented using
CUDA.jl. This brings the
transducers and reducing function combinators implemented in
Transducers.jl to GPU. Furthermore, using
FLoops.jl, you can write
parallel for
loops that run on GPU.
FoldsCUDA exports CUDAEx
, a parallel loop
executor.
It can be used with the parallel for
loop created with
FLoops.@floop
,
Base
-like high-level parallel API in
Folds.jl, and extensible
transducers provided by
Transducers.jl.
You can pass CUDA executor FoldsCUDA.CUDAEx()
to @floop
to run a
parallel for
loop on GPU:
julia> using FoldsCUDA, CUDA, FLoops
julia> using GPUArrays: @allowscalar
julia> xs = CUDA.rand(10^8);
julia> @allowscalar xs[100] = 2;
julia> @allowscalar xs[200] = 2;
julia> @floop CUDAEx() for (x, i) in zip(xs, eachindex(xs))
@reduce() do (imax = -1; i), (xmax = -Inf32; x)
if xmax < x
xmax = x
imax = i
end
end
end
julia> xmax
2.0f0
julia> imax # the *first* position for the largest value
100
julia> using Transducers, Folds
julia> @allowscalar xs[300] = -0.5;
julia> Folds.reduce(TeeRF(min, max), xs, CUDAEx())
(-0.5f0, 2.0f0)
julia> Folds.reduce(TeeRF(min, max), (2x for x in xs), CUDAEx()) # iterator comprehension works
(-1.0f0, 4.0f0)
julia> Folds.reduce(TeeRF(min, max), Map(x -> 2x)(xs), CUDAEx()) # equivalent, using a transducer
(-1.0f0, 4.0f0)
For more examples, see the examples section in the documentation.