From e10de135173c26ca89595e95b7ee08237b51af00 Mon Sep 17 00:00:00 2001
From: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Date: Wed, 18 Sep 2024 10:04:59 -0700
Subject: [PATCH 1/4] GPU technote updates for 2.2

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
---
 doc/rst/technotes/gpu.rst | 72 ++++++++++++++++++++++++---------------
 1 file changed, 44 insertions(+), 28 deletions(-)

diff --git a/doc/rst/technotes/gpu.rst b/doc/rst/technotes/gpu.rst
index 68ed74d7dfb..f9193212a20 100644
--- a/doc/rst/technotes/gpu.rst
+++ b/doc/rst/technotes/gpu.rst
@@ -5,8 +5,10 @@
 GPU Programming
 ===============
 
-Chapel can be used to program GPUs. Currently  NVIDIA and AMD GPUs are
-supported. Support for Intel GPUs is planned but not implemented, yet.
+Chapel can be used to program GPUs. The `GPU Programming in Chapel series
+<https://chapel-lang.org/blog/series/gpu-programming-in-chapel/>`_ is a good
+resource for getting started with GPU programming in Chapel. This technote has
+some examples, but it is closer to a reference manual than a tutorial.
 
 .. warning::
 
@@ -18,27 +20,38 @@ supported. Support for Intel GPUs is planned but not implemented, yet.
 Overview
 --------
 
-The Chapel compiler will generate GPU kernels for certain ``forall`` and
-``foreach`` loops and launch these onto a GPU when the current locale (e.g.
-``here``) is assigned to a special (sub)locale representing a GPU. To deploy
-code to a GPU, put the relevant code in an ``on`` statement targeting a GPU
-sublocale (i.e. ``here.gpus[0]``).
+The Chapel compiler will generate GPU kernels for certain parallel operations
+such as ``forall``/``foreach`` loops, ``reduce`` expressions and promoted
+expressions. These will be launched onto a GPU when the current locale (e.g.
+``here``) is the sublocale representing that particluar GPU. To deploy code to a
+GPU, put the relevant code in an ``on`` statement targeting a GPU sublocale
+(i.e. ``here.gpus[0]``).
 
 Any arrays that are declared by tasks executing on a GPU sublocale will, by
 default, be accessible on the GPU (see the `Memory Strategies`_ subsection for
 more information about alternate memory strategies).
 
-Chapel will launch kernels for all eligible loops that are encountered by tasks
-executing on a GPU sublocale.  Loops are eligible when:
+Chapel will launch kernels for all eligible data-parallel operations that are
+encountered by tasks executing on a GPU sublocale. Expressions are eligible
+when:
+
+* They are order-independent, such as:
+
+  * `forall <../users-guide/datapar/forall.html>`_ or `foreach <foreach.html>`_
+    loops over iterators that are also order-independent (i.e. the yielding loop
+    uses ``foreach`` loops instead of ``for``. All Chapel iterators of ranges,
+    domains and arrays are order-independent),
+
+  * ``reduce`` expressions over order-independent iterators,
+
+  * Promoted expressions over order-independent iterators.
 
-* They are order-independent. i.e., `forall
-  <../users-guide/datapar/forall.html>`_ or `foreach <foreach.html>`_ loops over
-  iterators that are also order-independent.
-* They only make use of known compiler primitives that are fast and local. Here
-  "fast" means "safe to run in a signal handler" and "local" means "doesn't
-  cause any network communication".
 * They do not call out to ``extern`` functions (aside from those in an exempted
   set of Chapel runtime functions).
+
+* They do not allocate memory dynamically (i.e. no class instances or Chapel
+  arrays are created within).
+
 * They are free of any call to a function that fails to meet the above
   criteria or accesses outer variables.
 
@@ -120,8 +133,9 @@ used with GPU support.
 
 The following are further requirements for GPU support:
 
-* For targeting NVIDIA or AMD GPUs, ``LLVM`` must be used as Chapel's backend
-  compiler (i.e.  ``CHPL_LLVM`` must be set to ``system`` or ``bundled``).
+* For targeting NVIDIA or AMD GPUs, the default ``LLVM`` backend must be used as
+  Chapel's backend compiler (i.e.  ``CHPL_LLVM`` must be set to ``system`` or
+  ``bundled``).
 
   * Note that ``CHPL_TARGET_COMPILER`` must be ``llvm``. This is the default
     when ``CHPL_LLVM`` is set to ``system`` or ``bundled``.
@@ -142,14 +156,16 @@ The following are further requirements for GPU support:
 
 * Specifically for targeting AMD GPUs:
 
-  * ROCm version between 4.x and 5.4 or between ROCm 6.0 and 6.2 must be installed.
+  * ROCm version between 4.x and 5.4 or between ROCm 6.0 and 6.2 must be
+    installed.
 
   * For ROCm 5.x, ``CHPL_LLVM`` must be set to ``system``. Note that, ROCm
     installations come with LLVM. Setting ``CHPL_LLVM=system`` will allow you to
     use that LLVM.
 
-  * For ROCm 6.x, only LLVM 18+ is supported. Currently, only
-    ``CHPL_LLVM=bundled`` is supported due to bugs in LLVM. 
+  * For ROCm 6.x, only ``CHPL_LLVM=bundled`` is supported. The bundled LLVM is
+    version 18 with a patch to support ROCm 6. Said patch is in LLVM 19, as such
+    we expect to support system LLVM 19+ with ROCm 6 in the upcoming releases.
 
 * Specifically for using the `CPU-as-Device mode`_:
 
@@ -427,16 +443,16 @@ For more examples see the tests under |multi_locale_dir|_ available from our
 
 Reductions and Scans
 ~~~~~~~~~~~~~~~~~~~~
+``+``, ``min`` and ``max`` reductions are supported via ``reduce`` expressions
+and intents. We are working towards expanding this to other kinds of reductions
+and ``scan`` expressions and deprecating the mentioned functions in the
+:mod:`GPU` module.
+
 The :mod:`GPU` module has standalone functions for basic reductions (e.g.
 :proc:`~GPU.gpuSumReduce`) and scans (e.g.  :proc:`~GPU.gpuScan`). We expect
 these functions to be deprecated in favor of ``reduce`` and ``scan`` expressions
 in a future release.
 
-As of Chapel 2.1, ``+``, ``min`` and ``max`` reductions are supported via
-``reduce`` expressions and intents. We are working towards expanding this to
-other kinds of reductions and ``scan`` expressions and deprecating the mentioned
-functions in the :mod:`GPU` module.
-
 Device-to-Device Communication Support
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Chapel supports direct communication between interconnected GPUs. The supported
@@ -616,11 +632,11 @@ Tested Configurations
 ---------------------
 
 We have experience with the following hardware and software versions. The ones
-marked with * are covered in our nightly testing configuration.
+marked with * are covered in our nightly testing configurations.
 
 * NVIDIA
 
-  * Hardware: RTX A2000, P100*, V100*, A100* and H100
+  * Hardware: RTX A2000, P100*, V100*, A100*, H100, GH200
 
   * Software: CUDA 11.3*, 11.6, 11.8*, 12.0, 12.2*, 12.4
 
@@ -628,7 +644,7 @@ marked with * are covered in our nightly testing configuration.
 
   * Hardware: MI60*, MI100 and MI250X*
 
-  * Software:ROCm 4.2*, 4.4, 5.4*, 6.0, 6.1, 6.2
+  * Software:ROCm 4.2*, 4.4, 5.4*, 6.0, 6.1, 6.2*
 
 
 GPU Support on Windows Subsystem for Linux

From 10f552a035abe84ba1510b4f1b0b715b0a177de5 Mon Sep 17 00:00:00 2001
From: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Date: Wed, 18 Sep 2024 10:11:07 -0700
Subject: [PATCH 2/4] Add the blog series at the end as well

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
---
 doc/rst/technotes/gpu.rst | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/doc/rst/technotes/gpu.rst b/doc/rst/technotes/gpu.rst
index f9193212a20..4cd1778e455 100644
--- a/doc/rst/technotes/gpu.rst
+++ b/doc/rst/technotes/gpu.rst
@@ -666,6 +666,10 @@ for more information on using Chapel with WSL.
 
 Further Information
 -------------------
+* The `GPU Programming in Chapel series
+  <https://chapel-lang.org/blog/series/gpu-programming-in-chapel/>`_ is a good
+  resource for getting started with GPU programming in Chapel.
+
 * Please refer to issues with `GPU Support label
   <https://github.com/chapel-lang/chapel/labels/area%3A%20GPU%20Support>`_ for
   other known limitations and issues.

From 1e8d36657d9232a06b0a3ba96d0844b07a86568a Mon Sep 17 00:00:00 2001
From: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Date: Wed, 18 Sep 2024 10:51:26 -0700
Subject: [PATCH 3/4] Drop details about ROCm 6

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
---
 doc/rst/technotes/gpu.rst | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/doc/rst/technotes/gpu.rst b/doc/rst/technotes/gpu.rst
index 4cd1778e455..c8c013ae425 100644
--- a/doc/rst/technotes/gpu.rst
+++ b/doc/rst/technotes/gpu.rst
@@ -163,9 +163,7 @@ The following are further requirements for GPU support:
     installations come with LLVM. Setting ``CHPL_LLVM=system`` will allow you to
     use that LLVM.
 
-  * For ROCm 6.x, only ``CHPL_LLVM=bundled`` is supported. The bundled LLVM is
-    version 18 with a patch to support ROCm 6. Said patch is in LLVM 19, as such
-    we expect to support system LLVM 19+ with ROCm 6 in the upcoming releases.
+  * For ROCm 6.x, only ``CHPL_LLVM=bundled`` is supported.
 
 * Specifically for using the `CPU-as-Device mode`_:
 

From 05d2c29570ea50bda8ed8b2977ad8d6630bffea3 Mon Sep 17 00:00:00 2001
From: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
Date: Wed, 18 Sep 2024 12:16:56 -0700
Subject: [PATCH 4/4] Take Andy's suggestion for the intro

Signed-off-by: Engin Kayraklioglu <e-kayrakli@users.noreply.github.com>
---
 doc/rst/technotes/gpu.rst | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/doc/rst/technotes/gpu.rst b/doc/rst/technotes/gpu.rst
index c8c013ae425..272279c634c 100644
--- a/doc/rst/technotes/gpu.rst
+++ b/doc/rst/technotes/gpu.rst
@@ -5,10 +5,16 @@
 GPU Programming
 ===============
 
-Chapel can be used to program GPUs. The `GPU Programming in Chapel series
-<https://chapel-lang.org/blog/series/gpu-programming-in-chapel/>`_ is a good
-resource for getting started with GPU programming in Chapel. This technote has
-some examples, but it is closer to a reference manual than a tutorial.
+Chapel enables developers to use parallelism at different levels: from
+intra-node multicore parallelism, to cross-node distributed parallelism, to
+GPUs. This technote serves as a reference on how to use Chapel to program GPUs.
+Specifically, it gives a quick overview of GPU programming, includes a handful
+of examples, discusses system requirements and current limitations for GPU
+support, and delves into more details on some specific GPU-related features.
+
+Readers preferring a more tutorial-like introduction to Chapel's GPU support,
+may also wish to look at our `GPU Programming in Chapel
+<https://chapel-lang.org/blog/series/gpu-programming-in-chapel/>`_ blog series.
 
 .. warning::