You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An analysis of the cost/benefits/risks associated with the proposal.
The key benefit of this proposal is that it would allow us to run code on GPUs, hopefully with minimal changes required to user code.
Producers
A list of the resources and named personnel required to implement the Proposal.
Components
A description of the main components of the software solution.
Inputs
A description of the inputs to the solution (designs, references, equations, closures, discussions etc).
The core ideas is to build a mechanism to describe GPU kernels at a high-level, which can then be used to generate efficient GPU code.
Horizontal kernel
Horizontal kernels are those which include horizontal (spectral element) operations. Internally, we will aim to generate kernels using a similar model as ClimateMachine.jl, which has proven to be very performant.
Threading will use a block for each slab, and a thread for each node inside the slab. As a result, simple broadcasting over a field e.g.
y .=f.(x)
would consist of a simple kernel, using KernelAbstractions.jl (KA from here on) notation:
output array that is local to the thread ([@private in KA], MArray in CUDA)
copy the input to work array
synchronization of threads
computation of the spectral derivatives from the scratch array
write to output array
For example, a gradient operation
grad = Operators.Gradient()
y .=grad.(f.(x))
would correspond to a kernel such as:
slabidx =@index(Group, Linear)
i,j =@index(Local, NTuple)
# 1) allocate
work_in =@localmem FT (Nq, Nq)
work_out =@private FT (Nq, Nq, 2)
# 2) copy to input work array, concretizing any intermediate steps
work_in[i,j] =f(x[slabidx][i,j])
# 3) syncronize threads@synchronize()
# 4) spectral derivatives
work_out[i,j,1] = work_out[i,j,2] =0.0for k =1:Nq
work_out[i,j,1] += D[i,k] * work_in[k,j]
work_out[i,j,2] += D[j,k] * work_in[i,k]
end# 5) write to output array
y[i,j] =Covariant12Vector(work_out[i,j,1], work_out[i,j,2])
Interface
We will continue to make use of the broadcasting syntax for high-level specification, and combined that with the byslab function. Consider the following expression:
byslab(space) do idx
grad = Operators.gradient()
y[idx] .=grad.(f.(x[idx]))
u[idx] .=g.(y[idx])
end
This is really just a different syntax for the following:
functionk(idx)
grad = Operators.gradient()
y[idx] .=grad.(f.(x[idx]))
u[idx] .=g.(y[idx])
endbyslab(k, space)
The basic idea is to transform k into a kernel function, which will then be launched by byslab.
We should be able to do this transformation by defining a SlabNodeIndex type:
struct SlabNodeIndex
slabidx::SlabIndex
nodeidx::Tuple{Int,Int}
end
and defining rules such that for
getindex(::Field, ::SlabNodeIndex) will give a "slab-node index view"
broadcasted will propagate this view
Operators.allocate_work will allocate the work arrays
Operators.apply_slab will:
copy to the node to the input work array
synchronize
compute spectral derivatives into the output work array
materialize! will copy the broadcasted result into the output array.
Results and deliverables
A description of the key results, deliverables, quality expectations, and performance metrics.
Initially, we will use a single thread by column (use a loop over levels within the kernel). This should be the simplest (no synchronization is required), and allows dependencies.
Explore alternatives if necessary (e.g. thread per node)
Can we apply operators by element or even a group of elements, instead of by slab. Using one block per each slab would result in about 16 to 25 threads per slab (our typical use case is a degree of 3 to 4). Typical recommendation is to use a minimum of 128 threads per block or more. A count of 16, for example would be half a warp (NVIDIA) or a quarter wavefront (AMD)! Using at least an element would help reach this number with about 8 levels!
Purpose
The purpose of the Proposal and how it accomplishes a team/CliMA objective.
Link to any relevant PRs/issues.
Add GPU support to ClimaCore
Cost/benefits/risks
An analysis of the cost/benefits/risks associated with the proposal.
The key benefit of this proposal is that it would allow us to run code on GPUs, hopefully with minimal changes required to user code.
Producers
A list of the resources and named personnel required to implement the Proposal.
Components
A description of the main components of the software solution.
Inputs
A description of the inputs to the solution (designs, references, equations, closures, discussions etc).
The core ideas is to build a mechanism to describe GPU kernels at a high-level, which can then be used to generate efficient GPU code.
Horizontal kernel
Horizontal kernels are those which include horizontal (spectral element) operations. Internally, we will aim to generate kernels using a similar model as ClimateMachine.jl, which has proven to be very performant.
Threading will use a block for each slab, and a thread for each node inside the slab. As a result, simple broadcasting over a field e.g.
would consist of a simple kernel, using KernelAbstractions.jl (KA from here on) notation:
Operators (gradient, divergence, curl, interpolation, restriction) will consist of the following operations:
@localmem
in KA, or shared memory in CUDA)@private
in KA],MArray
in CUDA)For example, a gradient operation
would correspond to a kernel such as:
Interface
We will continue to make use of the broadcasting syntax for high-level specification, and combined that with the
byslab
function. Consider the following expression:This is really just a different syntax for the following:
The basic idea is to transform
k
into a kernel function, which will then be launched bybyslab
.We should be able to do this transformation by defining a
SlabNodeIndex
type:and defining rules such that for
getindex(::Field, ::SlabNodeIndex)
will give a "slab-node index view"broadcasted
will propagate this viewOperators.allocate_work
will allocate the work arraysOperators.apply_slab
will:materialize!
will copy the broadcasted result into the output array.Results and deliverables
A description of the key results, deliverables, quality expectations, and performance metrics.
Task breakdown
Phase 1: shallow water equations
Topology2D
andDistributedTopology2D
(Remove Topology2D, rename DistributedTopology2D to Topology2D (reduce code duplication) #1057)CUDACommsContext
which wraps aCuContext
: any topologies, spaces, etc created with this will use CuArrays instead of Arrays #1085Topology2D
parametric #1074DataLayouts
#1075SpectralElementSpace2D
#1076Field
s: should be functional after spaces and fields #1077Timeline: end of January 2023
Phase 2: a 3D model
Timeline: aim for end of Feb 2023
Phase 3: Multi GPU and other features
Phase 4: Fix GPU-incompatible physics implementations in ClimaAtmos
Timeline: aim for end of May 2023
Reviewers
The names of CliMA personnel who will review the Proposal and subsequent PRs.
The text was updated successfully, but these errors were encountered: