diff --git a/doc/rst/technotes/gpu.rst b/doc/rst/technotes/gpu.rst
index 68ed74d7dfb..272279c634c 100644
--- a/doc/rst/technotes/gpu.rst
+++ b/doc/rst/technotes/gpu.rst
@@ -5,8 +5,16 @@
GPU Programming
===============
-Chapel can be used to program GPUs. Currently NVIDIA and AMD GPUs are
-supported. Support for Intel GPUs is planned but not implemented, yet.
+Chapel enables developers to use parallelism at different levels: from
+intra-node multicore parallelism, to cross-node distributed parallelism, to
+GPUs. This technote serves as a reference on how to use Chapel to program GPUs.
+Specifically, it gives a quick overview of GPU programming, includes a handful
+of examples, discusses system requirements and current limitations for GPU
+support, and delves into more details on some specific GPU-related features.
+
+Readers preferring a more tutorial-like introduction to Chapel's GPU support,
+may also wish to look at our `GPU Programming in Chapel
+`_ blog series.
.. warning::
@@ -18,27 +26,38 @@ supported. Support for Intel GPUs is planned but not implemented, yet.
Overview
--------
-The Chapel compiler will generate GPU kernels for certain ``forall`` and
-``foreach`` loops and launch these onto a GPU when the current locale (e.g.
-``here``) is assigned to a special (sub)locale representing a GPU. To deploy
-code to a GPU, put the relevant code in an ``on`` statement targeting a GPU
-sublocale (i.e. ``here.gpus[0]``).
+The Chapel compiler will generate GPU kernels for certain parallel operations
+such as ``forall``/``foreach`` loops, ``reduce`` expressions and promoted
+expressions. These will be launched onto a GPU when the current locale (e.g.
+``here``) is the sublocale representing that particluar GPU. To deploy code to a
+GPU, put the relevant code in an ``on`` statement targeting a GPU sublocale
+(i.e. ``here.gpus[0]``).
Any arrays that are declared by tasks executing on a GPU sublocale will, by
default, be accessible on the GPU (see the `Memory Strategies`_ subsection for
more information about alternate memory strategies).
-Chapel will launch kernels for all eligible loops that are encountered by tasks
-executing on a GPU sublocale. Loops are eligible when:
+Chapel will launch kernels for all eligible data-parallel operations that are
+encountered by tasks executing on a GPU sublocale. Expressions are eligible
+when:
+
+* They are order-independent, such as:
+
+ * `forall <../users-guide/datapar/forall.html>`_ or `foreach `_
+ loops over iterators that are also order-independent (i.e. the yielding loop
+ uses ``foreach`` loops instead of ``for``. All Chapel iterators of ranges,
+ domains and arrays are order-independent),
+
+ * ``reduce`` expressions over order-independent iterators,
+
+ * Promoted expressions over order-independent iterators.
-* They are order-independent. i.e., `forall
- <../users-guide/datapar/forall.html>`_ or `foreach `_ loops over
- iterators that are also order-independent.
-* They only make use of known compiler primitives that are fast and local. Here
- "fast" means "safe to run in a signal handler" and "local" means "doesn't
- cause any network communication".
* They do not call out to ``extern`` functions (aside from those in an exempted
set of Chapel runtime functions).
+
+* They do not allocate memory dynamically (i.e. no class instances or Chapel
+ arrays are created within).
+
* They are free of any call to a function that fails to meet the above
criteria or accesses outer variables.
@@ -120,8 +139,9 @@ used with GPU support.
The following are further requirements for GPU support:
-* For targeting NVIDIA or AMD GPUs, ``LLVM`` must be used as Chapel's backend
- compiler (i.e. ``CHPL_LLVM`` must be set to ``system`` or ``bundled``).
+* For targeting NVIDIA or AMD GPUs, the default ``LLVM`` backend must be used as
+ Chapel's backend compiler (i.e. ``CHPL_LLVM`` must be set to ``system`` or
+ ``bundled``).
* Note that ``CHPL_TARGET_COMPILER`` must be ``llvm``. This is the default
when ``CHPL_LLVM`` is set to ``system`` or ``bundled``.
@@ -142,14 +162,14 @@ The following are further requirements for GPU support:
* Specifically for targeting AMD GPUs:
- * ROCm version between 4.x and 5.4 or between ROCm 6.0 and 6.2 must be installed.
+ * ROCm version between 4.x and 5.4 or between ROCm 6.0 and 6.2 must be
+ installed.
* For ROCm 5.x, ``CHPL_LLVM`` must be set to ``system``. Note that, ROCm
installations come with LLVM. Setting ``CHPL_LLVM=system`` will allow you to
use that LLVM.
- * For ROCm 6.x, only LLVM 18+ is supported. Currently, only
- ``CHPL_LLVM=bundled`` is supported due to bugs in LLVM.
+ * For ROCm 6.x, only ``CHPL_LLVM=bundled`` is supported.
* Specifically for using the `CPU-as-Device mode`_:
@@ -427,16 +447,16 @@ For more examples see the tests under |multi_locale_dir|_ available from our
Reductions and Scans
~~~~~~~~~~~~~~~~~~~~
+``+``, ``min`` and ``max`` reductions are supported via ``reduce`` expressions
+and intents. We are working towards expanding this to other kinds of reductions
+and ``scan`` expressions and deprecating the mentioned functions in the
+:mod:`GPU` module.
+
The :mod:`GPU` module has standalone functions for basic reductions (e.g.
:proc:`~GPU.gpuSumReduce`) and scans (e.g. :proc:`~GPU.gpuScan`). We expect
these functions to be deprecated in favor of ``reduce`` and ``scan`` expressions
in a future release.
-As of Chapel 2.1, ``+``, ``min`` and ``max`` reductions are supported via
-``reduce`` expressions and intents. We are working towards expanding this to
-other kinds of reductions and ``scan`` expressions and deprecating the mentioned
-functions in the :mod:`GPU` module.
-
Device-to-Device Communication Support
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Chapel supports direct communication between interconnected GPUs. The supported
@@ -616,11 +636,11 @@ Tested Configurations
---------------------
We have experience with the following hardware and software versions. The ones
-marked with * are covered in our nightly testing configuration.
+marked with * are covered in our nightly testing configurations.
* NVIDIA
- * Hardware: RTX A2000, P100*, V100*, A100* and H100
+ * Hardware: RTX A2000, P100*, V100*, A100*, H100, GH200
* Software: CUDA 11.3*, 11.6, 11.8*, 12.0, 12.2*, 12.4
@@ -628,7 +648,7 @@ marked with * are covered in our nightly testing configuration.
* Hardware: MI60*, MI100 and MI250X*
- * Software:ROCm 4.2*, 4.4, 5.4*, 6.0, 6.1, 6.2
+ * Software:ROCm 4.2*, 4.4, 5.4*, 6.0, 6.1, 6.2*
GPU Support on Windows Subsystem for Linux
@@ -650,6 +670,10 @@ for more information on using Chapel with WSL.
Further Information
-------------------
+* The `GPU Programming in Chapel series
+ `_ is a good
+ resource for getting started with GPU programming in Chapel.
+
* Please refer to issues with `GPU Support label
`_ for
other known limitations and issues.