GPU support #974

simonbyrne · 2022-09-30T22:58:53Z

Purpose

The purpose of the Proposal and how it accomplishes a team/CliMA objective.

Link to any relevant PRs/issues.

apply operators byslab, GPU operators #767

Add GPU support to ClimaCore

Cost/benefits/risks

An analysis of the cost/benefits/risks associated with the proposal.

The key benefit of this proposal is that it would allow us to run code on GPUs, hopefully with minimal changes required to user code.

Producers

A list of the resources and named personnel required to implement the Proposal.

Components

A description of the main components of the software solution.

Inputs

A description of the inputs to the solution (designs, references, equations, closures, discussions etc).

The core ideas is to build a mechanism to describe GPU kernels at a high-level, which can then be used to generate efficient GPU code.

Horizontal kernel

Horizontal kernels are those which include horizontal (spectral element) operations. Internally, we will aim to generate kernels using a similar model as ClimateMachine.jl, which has proven to be very performant.

Threading will use a block for each slab, and a thread for each node inside the slab. As a result, simple broadcasting over a field e.g.

y .= f.(x)

would consist of a simple kernel, using KernelAbstractions.jl (KA from here on) notation:

slabidx = @index(Group, Linear)
i,j = @index(Local, NTuple)

y[slabidx][i,j] = f(x[slabidx][i,j])

Operators (gradient, divergence, curl, interpolation, restriction) will consist of the following operations:

allocate work arrays
- input array that is local to the block (@localmem in KA, or shared memory in CUDA)
- output array that is local to the thread ([@private in KA], MArray in CUDA)
copy the input to work array
synchronization of threads
computation of the spectral derivatives from the scratch array
write to output array

For example, a gradient operation

grad = Operators.Gradient()
y .= grad.(f.(x))

would correspond to a kernel such as:

slabidx = @index(Group, Linear)
i,j = @index(Local, NTuple)

# 1) allocate
work_in = @localmem FT (Nq, Nq)
work_out = @private FT (Nq, Nq, 2)

# 2) copy to input work array, concretizing any intermediate steps
work_in[i,j] = f(x[slabidx][i,j])

# 3) syncronize threads
@synchronize()

# 4) spectral derivatives
work_out[i,j,1] = work_out[i,j,2] = 0.0
for k = 1:Nq
    work_out[i,j,1] += D[i,k] * work_in[k,j]
    work_out[i,j,2] += D[j,k] * work_in[i,k]
end

# 5) write to output array
y[i,j] = Covariant12Vector(work_out[i,j,1], work_out[i,j,2])

Interface

We will continue to make use of the broadcasting syntax for high-level specification, and combined that with the byslab function. Consider the following expression:

byslab(space) do idx
    grad = Operators.gradient()
    y[idx] .= grad.(f.(x[idx]))
    u[idx] .= g.(y[idx])
end

This is really just a different syntax for the following:

function k(idx)
    grad = Operators.gradient()
    y[idx] .= grad.(f.(x[idx]))
    u[idx] .= g.(y[idx])
end
byslab(k, space)

The basic idea is to transform k into a kernel function, which will then be launched by byslab.

We should be able to do this transformation by defining a SlabNodeIndex type:

struct SlabNodeIndex
    slabidx::SlabIndex
    nodeidx::Tuple{Int,Int}
end

and defining rules such that for

getindex(::Field, ::SlabNodeIndex) will give a "slab-node index view"
broadcasted will propagate this view
Operators.allocate_work will allocate the work arrays
Operators.apply_slab will:
- copy to the node to the input work array
- synchronize
- compute spectral derivatives into the output work array
materialize! will copy the broadcasted result into the output array.

Results and deliverables

A description of the key results, deliverables, quality expectations, and performance metrics.

Task breakdown

Phase 1: shallow water equations

Timeline: end of January 2023

Phase 2: a 3D model

Timeline: aim for end of Feb 2023

Phase 3: Multi GPU and other features

Limiters
Vertical integrals
MPI GPU support
- Set up Slurm & CI
- Assigning GPUs among processes: see https://github.com/CliMA/ClimateMachine.jl/blob/2def0d653fcef3d069e816844fa4ec5d745c84d1/src/Driver/Driver.jl#L43
Single-node multi GPU support

Phase 4: Fix GPU-incompatible physics implementations in ClimaAtmos

Gravity wave parameterization: Add non-or0graphic gravity wave parameterization. ClimaAtmos.jl#852

Timeline: aim for end of May 2023

Reviewers

The names of CliMA personnel who will review the Proposal and subsequent PRs.

The text was updated successfully, but these errors were encountered:

sriharshakandala · 2022-11-28T21:46:42Z

Can we apply operators by element or even a group of elements, instead of by slab. Using one block per each slab would result in about 16 to 25 threads per slab (our typical use case is a degree of 3 to 4). Typical recommendation is to use a minimum of 128 threads per block or more. A count of 16, for example would be half a warp (NVIDIA) or a quarter wavefront (AMD)! Using at least an element would help reach this number with about 8 levels!

bischtob · 2023-01-10T18:17:38Z

@simonbyrne is this still a draft or a we fully ready here?

simonbyrne · 2023-01-10T18:53:38Z

I guess it's ready?

simonbyrne added draft draft PR SDI Software Development Issue labels Sep 30, 2022

bischtob pinned this issue Dec 20, 2022

simonbyrne removed the draft draft PR label Jan 10, 2023

bischtob added this to the GPU support milestone Jan 13, 2023

szy21 removed this from the Shallow-water model on single GPU milestone Jan 25, 2023

cmbengue assigned simonbyrne and sriharshakandala Jan 25, 2023

charleskawczynski mentioned this issue Apr 8, 2023

Add CuArray-backed extruded spaces tests #1176

Closed

cmbengue added this to the Implicit Dry Held-Suarez on GPU milestone Aug 21, 2023

Sbozzolo unpinned this issue Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU support #974

GPU support #974

simonbyrne commented Sep 30, 2022 •

edited by sriharshakandala

Loading

sriharshakandala commented Nov 28, 2022 •

edited

Loading

bischtob commented Jan 10, 2023

simonbyrne commented Jan 10, 2023

GPU support #974

GPU support #974

Comments

simonbyrne commented Sep 30, 2022 • edited by sriharshakandala Loading

Purpose

Cost/benefits/risks

Producers

Components

Inputs

Horizontal kernel

Interface

Results and deliverables

Task breakdown

Phase 1: shallow water equations

Phase 2: a 3D model

Phase 3: Multi GPU and other features

Phase 4: Fix GPU-incompatible physics implementations in ClimaAtmos

Reviewers

sriharshakandala commented Nov 28, 2022 • edited Loading

bischtob commented Jan 10, 2023

simonbyrne commented Jan 10, 2023

simonbyrne commented Sep 30, 2022 •

edited by sriharshakandala

Loading

sriharshakandala commented Nov 28, 2022 •

edited

Loading