GPU technote updates for 2.2 (#25960)

Some changes to call out here: - More confident tone, where Intel limitation is removed from the first paragraph and replaced with a link to the blog series - Replacing the focus on `foreach`/`forall` with order-independent statements to cover promotion and reduction - Highlight `reduce` expressions more than the standalone functions - Massaging the language around ROCm 6 and LLVM versions [Reviewed by @jabraham17 and @stonea]
chapel-lang · Sep 18, 2024 · 24a68ca · 24a68ca
2 parents 2dba35d + 05d2c29
commit 24a68ca
Showing 1 changed file with 52 additions and 28 deletions.
diff --git a/doc/rst/technotes/gpu.rst b/doc/rst/technotes/gpu.rst
@@ -5,8 +5,16 @@
 GPU Programming
 ===============
 
-Chapel can be used to program GPUs. Currently  NVIDIA and AMD GPUs are
-supported. Support for Intel GPUs is planned but not implemented, yet.
+Chapel enables developers to use parallelism at different levels: from
+intra-node multicore parallelism, to cross-node distributed parallelism, to
+GPUs. This technote serves as a reference on how to use Chapel to program GPUs.
+Specifically, it gives a quick overview of GPU programming, includes a handful
+of examples, discusses system requirements and current limitations for GPU
+support, and delves into more details on some specific GPU-related features.
+
+Readers preferring a more tutorial-like introduction to Chapel's GPU support,
+may also wish to look at our `GPU Programming in Chapel
+<https://chapel-lang.org/blog/series/gpu-programming-in-chapel/>`_ blog series.
 
 .. warning::
 
@@ -18,27 +26,38 @@ supported. Support for Intel GPUs is planned but not implemented, yet.
 Overview
 --------
 
-The Chapel compiler will generate GPU kernels for certain ``forall`` and
-``foreach`` loops and launch these onto a GPU when the current locale (e.g.
-``here``) is assigned to a special (sub)locale representing a GPU. To deploy
-code to a GPU, put the relevant code in an ``on`` statement targeting a GPU
-sublocale (i.e. ``here.gpus[0]``).
+The Chapel compiler will generate GPU kernels for certain parallel operations
+such as ``forall``/``foreach`` loops, ``reduce`` expressions and promoted
+expressions. These will be launched onto a GPU when the current locale (e.g.
+``here``) is the sublocale representing that particluar GPU. To deploy code to a
+GPU, put the relevant code in an ``on`` statement targeting a GPU sublocale
+(i.e. ``here.gpus[0]``).
 
 Any arrays that are declared by tasks executing on a GPU sublocale will, by
 default, be accessible on the GPU (see the `Memory Strategies`_ subsection for
 more information about alternate memory strategies).
 
-Chapel will launch kernels for all eligible loops that are encountered by tasks
-executing on a GPU sublocale.  Loops are eligible when:
+Chapel will launch kernels for all eligible data-parallel operations that are
+encountered by tasks executing on a GPU sublocale. Expressions are eligible
+when:
+
+* They are order-independent, such as:
+
+  * `forall <../users-guide/datapar/forall.html>`_ or `foreach <foreach.html>`_
+    loops over iterators that are also order-independent (i.e. the yielding loop
+    uses ``foreach`` loops instead of ``for``. All Chapel iterators of ranges,
+    domains and arrays are order-independent),
+
+  * ``reduce`` expressions over order-independent iterators,
+
+  * Promoted expressions over order-independent iterators.
 
-* They are order-independent. i.e., `forall
-  <../users-guide/datapar/forall.html>`_ or `foreach <foreach.html>`_ loops over
-  iterators that are also order-independent.
-* They only make use of known compiler primitives that are fast and local. Here
-  "fast" means "safe to run in a signal handler" and "local" means "doesn't
-  cause any network communication".
 * They do not call out to ``extern`` functions (aside from those in an exempted
   set of Chapel runtime functions).
+
+* They do not allocate memory dynamically (i.e. no class instances or Chapel
+  arrays are created within).
+
 * They are free of any call to a function that fails to meet the above
   criteria or accesses outer variables.
 
@@ -120,8 +139,9 @@ used with GPU support.
 
 The following are further requirements for GPU support:
 
-* For targeting NVIDIA or AMD GPUs, ``LLVM`` must be used as Chapel's backend
-  compiler (i.e.  ``CHPL_LLVM`` must be set to ``system`` or ``bundled``).
+* For targeting NVIDIA or AMD GPUs, the default ``LLVM`` backend must be used as
+  Chapel's backend compiler (i.e.  ``CHPL_LLVM`` must be set to ``system`` or
+  ``bundled``).
 
   * Note that ``CHPL_TARGET_COMPILER`` must be ``llvm``. This is the default
     when ``CHPL_LLVM`` is set to ``system`` or ``bundled``.
@@ -142,14 +162,14 @@ The following are further requirements for GPU support:
 
 * Specifically for targeting AMD GPUs:
 
-  * ROCm version between 4.x and 5.4 or between ROCm 6.0 and 6.2 must be installed.
+  * ROCm version between 4.x and 5.4 or between ROCm 6.0 and 6.2 must be
+    installed.
 
   * For ROCm 5.x, ``CHPL_LLVM`` must be set to ``system``. Note that, ROCm
     installations come with LLVM. Setting ``CHPL_LLVM=system`` will allow you to
     use that LLVM.
 
-  * For ROCm 6.x, only LLVM 18+ is supported. Currently, only
-    ``CHPL_LLVM=bundled`` is supported due to bugs in LLVM. 
+  * For ROCm 6.x, only ``CHPL_LLVM=bundled`` is supported.
 
 * Specifically for using the `CPU-as-Device mode`_:
 
@@ -427,16 +447,16 @@ For more examples see the tests under |multi_locale_dir|_ available from our
 
 Reductions and Scans
 ~~~~~~~~~~~~~~~~~~~~
+``+``, ``min`` and ``max`` reductions are supported via ``reduce`` expressions
+and intents. We are working towards expanding this to other kinds of reductions
+and ``scan`` expressions and deprecating the mentioned functions in the
+:mod:`GPU` module.
+
 The :mod:`GPU` module has standalone functions for basic reductions (e.g.
 :proc:`~GPU.gpuSumReduce`) and scans (e.g.  :proc:`~GPU.gpuScan`). We expect
 these functions to be deprecated in favor of ``reduce`` and ``scan`` expressions
 in a future release.
 
-As of Chapel 2.1, ``+``, ``min`` and ``max`` reductions are supported via
-``reduce`` expressions and intents. We are working towards expanding this to
-other kinds of reductions and ``scan`` expressions and deprecating the mentioned
-functions in the :mod:`GPU` module.
-
 Device-to-Device Communication Support
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Chapel supports direct communication between interconnected GPUs. The supported
@@ -616,19 +636,19 @@ Tested Configurations
 ---------------------
 
 We have experience with the following hardware and software versions. The ones
-marked with * are covered in our nightly testing configuration.
+marked with * are covered in our nightly testing configurations.
 
 * NVIDIA
 
-  * Hardware: RTX A2000, P100*, V100*, A100* and H100
+  * Hardware: RTX A2000, P100*, V100*, A100*, H100, GH200
 
   * Software: CUDA 11.3*, 11.6, 11.8*, 12.0, 12.2*, 12.4
 
 * AMD
 
   * Hardware: MI60*, MI100 and MI250X*
 
-  * Software:ROCm 4.2*, 4.4, 5.4*, 6.0, 6.1, 6.2
+  * Software:ROCm 4.2*, 4.4, 5.4*, 6.0, 6.1, 6.2*
 
 
 GPU Support on Windows Subsystem for Linux
@@ -650,6 +670,10 @@ for more information on using Chapel with WSL.
 
 Further Information
 -------------------
+* The `GPU Programming in Chapel series
+  <https://chapel-lang.org/blog/series/gpu-programming-in-chapel/>`_ is a good
+  resource for getting started with GPU programming in Chapel.
+
 * Please refer to issues with `GPU Support label
   <https://github.com/chapel-lang/chapel/labels/area%3A%20GPU%20Support>`_ for
   other known limitations and issues.