Docs: Expand HIP porting guide and CUDA driver porting guide

MKKnorr · MKKnorr · commit a414cb28cf21 · 2025-01-30T16:27:20.000+01:00
diff --git a/docs/how-to/hip_cpp_language_extensions.rst b/docs/how-to/hip_cpp_language_extensions.rst
@@ -250,43 +250,6 @@ Units, also known as SIMDs, each with their own register file. For more
 information see :doc:`../understand/hardware_implementation`.
 :cpp:struct:`hipDeviceProp_t` also has a field ``executionUnitsPerMultiprocessor``.
 
-Porting from CUDA __launch_bounds__
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-CUDA also defines a ``__launch_bounds__`` qualifier which works similar to HIP's
-implementation, however it uses different parameters:
-
-.. code-block:: cpp
-
-  __launch_bounds__(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MULTIPROCESSOR)
-
-The first parameter is the same as HIP's implementation, but
-``MIN_BLOCKS_PER_MULTIPROCESSOR`` must  be converted to
-``MIN_WARPS_PER_EXECUTION``, which uses warps and execution units rather than
-blocks and multiprocessors. This conversion is performed automatically by
-:doc:`HIPIFY <hipify:index>`, or can be done manually with the following
-equation.
-
-.. code-block:: cpp
-
-  MIN_WARPS_PER_EXECUTION_UNIT = (MIN_BLOCKS_PER_MULTIPROCESSOR * MAX_THREADS_PER_BLOCK) / warpSize
-
-Directly controlling the warps per execution unit makes it easier to reason
-about the occupancy, unlike with blocks, where the occupancy depends on the
-block size.
-
-The use of execution units rather than multiprocessors also provides support for
-architectures with multiple execution units per multiprocessor. For example, the
-AMD GCN architecture has 4 execution units per multiprocessor.
-
-maxregcount
-""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
-
-Unlike ``nvcc``, ``amdclang++`` does not support the ``--maxregcount`` option.
-Instead, users are encouraged to use the ``__launch_bounds__`` directive since
-the parameters are more intuitive and portable than micro-architecture details
-like registers. The directive allows per-kernel control.
-
 Memory space qualifiers
 ================================================================================
 
diff --git a/docs/how-to/hip_porting_driver_api.rst b/docs/how-to/hip_porting_driver_api.rst
@@ -1,33 +1,30 @@
 .. meta::
   :description: This chapter presents how to port the CUDA driver API and showcases equivalent operations in HIP.
-  :keywords: AMD, ROCm, HIP, CUDA, driver API
+  :keywords: AMD, ROCm, HIP, CUDA, driver API, porting, port
 
 .. _porting_driver_api:
 
 *******************************************************************************
 Porting CUDA driver API
 *******************************************************************************
 
-NVIDIA provides separate CUDA driver and runtime APIs. The two APIs have
-significant overlap in functionality:
-
-* Both APIs support events, streams, memory management, memory copy, and error
-  handling.
-
-* Both APIs deliver similar performance.
+CUDA provides separate driver and runtime APIs. The two APIs generally provide
+the same functionality, however the driver API allows for more fine-grained
+control over initialization and context- and module-management. This is all
+taken care of implicitly by the runtime API.
 
 * Driver API calls begin with the prefix ``cu``, while runtime API calls begin
   with the prefix ``cuda``. For example, the driver API contains
   ``cuEventCreate``, while the runtime API contains ``cudaEventCreate``, which
   has similar functionality.
 
-* The driver API defines a different, but largely overlapping, error code space
-  than the runtime API and uses a different coding convention. For example, the
-  driver API defines ``CUDA_ERROR_INVALID_VALUE``, while the runtime API defines
-  ``cudaErrorInvalidValue``.
+* The driver API offers two additional functionalities not directly provided by
+  the runtime API: ``cuModule`` and ``cuCtx`` APIs.
 
-The driver API offers two additional functionalities not provided by the runtime
-API: ``cuModule`` and ``cuCtx`` APIs.
+HIP does not explicitly provide two different APIs, the corresponding functions
+for the CUDA driver API are available in the HIP runtime API, and are usually
+prefixed with ``hipDrv``. The module and context functionality is available with
+the ``hipModule`` and ``hipCtx`` prefix.
 
 cuModule API
 ================================================================================
@@ -123,8 +120,8 @@ HIPIFY translation of CUDA driver API
 The HIPIFY tools convert CUDA driver APIs for streams, events, modules, devices, memory management, context, and the profiler to the equivalent HIP calls. For example, ``cuEventCreate`` is translated to ``hipEventCreate``.
 HIPIFY tools also convert error codes from the driver namespace and coding conventions to the equivalent HIP error code. HIP unifies the APIs for these common functions.
 
-The memory copy API requires additional explanation. The CUDA driver includes the memory direction in the name of the API (``cuMemcpyH2D``), while the CUDA driver API provides a single memory copy API with a parameter that specifies the direction. It also supports a "default" direction where the runtime determines the direction automatically.
-HIP provides APIs with both styles, for example, ``hipMemcpyH2D`` as well as ``hipMemcpy``.
+The memory copy API requires additional explanation. The CUDA driver includes the memory direction in the name of the API (``cuMemcpyH2D``), while the CUDA runtime API provides a single memory copy API with a parameter that specifies the direction. It also supports a "default" direction where the runtime determines the direction automatically.
+HIP provides both versions, for example, ``hipMemcpyH2D`` as well as ``hipMemcpy``.
 The first version might be faster in some cases because it avoids any host overhead to detect the different memory directions.
 
 HIP defines a single error space and uses camel case for all errors (i.e. ``hipErrorInvalidValue``).
@@ -547,3 +544,67 @@ The HIP version number is defined as an integer:
 .. code-block:: cpp
 
   HIP_VERSION=HIP_VERSION_MAJOR * 10000000 + HIP_VERSION_MINOR * 100000 + HIP_VERSION_PATCH
+
+********************************************************************************
+CU_POINTER_ATTRIBUTE_MEMORY_TYPE
+********************************************************************************
+
+To get the pointer's memory type in HIP, developers should use
+:cpp:func:`hipPointerGetAttributes`. First parameter of the function is
+`hipPointerAttribute_t`. Its ``type`` member variable indicates whether the
+memory pointed to is allocated on the device or the host.
+
+For example:
+
+.. code-block:: cpp
+
+  double * ptr;
+  hipMalloc(&ptr, sizeof(double));
+  hipPointerAttribute_t attr;
+  hipPointerGetAttributes(&attr, ptr); /*attr.type is hipMemoryTypeDevice*/
+  if(attr.type == hipMemoryTypeDevice)
+    std::cout << "ptr is of type hipMemoryTypeDevice" << std::endl;
+
+  double* ptrHost;
+  hipHostMalloc(&ptrHost, sizeof(double));
+  hipPointerAttribute_t attr;
+  hipPointerGetAttributes(&attr, ptrHost); /*attr.type is hipMemoryTypeHost*/
+  if(attr.type == hipMemorTypeHost)
+    std::cout << "ptrHost is of type hipMemoryTypeHost" << std::endl;
+
+Note that ``hipMemoryType`` enum values are different from the
+``cudaMemoryType`` enum values.
+
+For example, on AMD platform, `hipMemoryType` is defined in `hip_runtime_api.h`,
+
+.. code-block:: cpp
+
+  typedef enum hipMemoryType {
+      hipMemoryTypeHost = 0,    ///< Memory is physically located on host
+      hipMemoryTypeDevice = 1,  ///< Memory is physically located on device. (see deviceId for specific device)
+      hipMemoryTypeArray = 2,   ///< Array memory, physically located on device. (see deviceId for specific device)
+      hipMemoryTypeUnified = 3, ///< Not used currently
+      hipMemoryTypeManaged = 4  ///< Managed memory, automaticallly managed by the unified memory system
+  } hipMemoryType;
+
+Looking into CUDA toolkit, it defines `cudaMemoryType` as following,
+
+.. code-block:: cpp
+
+  enum cudaMemoryType
+  {
+    cudaMemoryTypeUnregistered = 0, // Unregistered memory.
+    cudaMemoryTypeHost = 1, // Host memory.
+    cudaMemoryTypeDevice = 2, // Device memory.
+    cudaMemoryTypeManaged = 3, // Managed memory
+  }
+
+In this case, memory type translation for `hipPointerGetAttributes` needs to be handled properly on NVIDIA platform to get the correct memory type in CUDA, which is done in the file `nvidia_hip_runtime_api.h`.
+
+So in any HIP applications which use HIP APIs involving memory types, developers should use `#ifdef` in order to assign the correct enum values depending on NVIDIA or AMD platform.
+
+As an example, please see the code from the `link <https://github.com/ROCm/hip-tests/tree/develop/catch/unit/memory/hipMemcpyParam2D.cc>`_.
+
+With the `#ifdef` condition, HIP APIs work as expected on both AMD and NVIDIA platforms.
+
+Note, `cudaMemoryTypeUnregistered` is currently not supported as `hipMemoryType` enum, due to HIP functionality backward compatibility.
diff --git a/docs/how-to/hip_porting_guide.rst b/docs/how-to/hip_porting_guide.rst
@@ -14,10 +14,21 @@ suggestions on how to port CUDA code and work through common issues.
 Porting a CUDA Project
 ********************************************************************************
 
+Mixing HIP and CUDA code results in valid CUDA code. This enables users to
+incrementally port CUDA to HIP, and still compile and test the code during the
+transition.
+
+The only notable exception is ``hipError_t``, which is not just an alias to
+``cudaError_t``. In these cases HIP provides functions to convert between the
+error code spaces:
+
+:cpp:func:`hipErrorToCudaError`
+:cpp:func:`hipCUDAErrorTohipError`
+:cpp:func:`hipCUResultTohipError`
+
 General Tips
 ================================================================================
 
-* You can incrementally port pieces of the code to HIP while leaving the rest in CUDA. HIP is just a thin layer over CUDA, so the two languages can interoperate.
 * Starting to port on an NVIDIA machine is often the easiest approach, as the code can be tested for functionality and performance even if not fully ported to HIP.
 * Once the CUDA code is ported to HIP and is running on the CUDA machine, compile the HIP code for an AMD machine.
 * You can handle platform-specific features through conditional compilation or by adding them to the open-source HIP infrastructure.
@@ -533,16 +544,6 @@ supports, together with the corresponding macros and device properties.
    - ``hasDynamicParallelism``
    - Ability to launch a kernel from within a kernel
 
-********************************************************************************
-Finding HIP
-********************************************************************************
-
-Makefiles can use the following syntax to conditionally provide a default HIP_PATH if one does not exist:
-
-.. code-block:: shell
-
-  HIP_PATH ?= $(shell hipconfig --path)
-
 ********************************************************************************
 Compilation
 ********************************************************************************
@@ -555,6 +556,12 @@ options are appropriate for the target compiler.
 ``hipconfig`` is a helpful tool in identifying the current systems platform,
 compiler and runtime. It can also help set options appropriately.
 
+As an example, it can provide a path to HIP, in Makefiles for example:
+
+.. code-block:: shell
+
+  HIP_PATH ?= $(shell hipconfig --path)
+
 HIP Headers
 ================================================================================
 
@@ -602,3 +609,41 @@ platforms and architectures. The ``warpSize`` built-in should be used in device
 code, while the host can query it during runtime via the device properties. See
 the :ref:`HIP language extension for warpSize <warp_size>` for information on
 how to write portable wave-aware code.
+
+********************************************************************************
+Porting from CUDA __launch_bounds__
+********************************************************************************
+
+CUDA also defines a ``__launch_bounds__`` qualifier which works similar to HIP's
+implementation, however it uses different parameters:
+
+.. code-block:: cpp
+
+  __launch_bounds__(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MULTIPROCESSOR)
+
+The first parameter is the same as HIP's implementation, but
+``MIN_BLOCKS_PER_MULTIPROCESSOR`` must  be converted to
+``MIN_WARPS_PER_EXECUTION``, which uses warps and execution units rather than
+blocks and multiprocessors. This conversion is performed automatically by
+:doc:`HIPIFY <hipify:index>`, or can be done manually with the following
+equation.
+
+.. code-block:: cpp
+
+  MIN_WARPS_PER_EXECUTION_UNIT = (MIN_BLOCKS_PER_MULTIPROCESSOR * MAX_THREADS_PER_BLOCK) / warpSize
+
+Directly controlling the warps per execution unit makes it easier to reason
+about the occupancy, unlike with blocks, where the occupancy depends on the
+block size.
+
+The use of execution units rather than multiprocessors also provides support for
+architectures with multiple execution units per multiprocessor. For example, the
+AMD GCN architecture has 4 execution units per multiprocessor.
+
+maxregcount
+================================================================================
+
+Unlike ``nvcc``, ``amdclang++`` does not support the ``--maxregcount`` option.
+Instead, users are encouraged to use the ``__launch_bounds__`` directive since
+the parameters are more intuitive and portable than micro-architecture details
+like registers. The directive allows per-kernel control.