Skip to content

Commit a414cb2

Browse files
committed
Docs: Expand HIP porting guide and CUDA driver porting guide
1 parent 742be96 commit a414cb2

File tree

3 files changed

+133
-64
lines changed

3 files changed

+133
-64
lines changed

docs/how-to/hip_cpp_language_extensions.rst

Lines changed: 0 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -250,43 +250,6 @@ Units, also known as SIMDs, each with their own register file. For more
250250
information see :doc:`../understand/hardware_implementation`.
251251
:cpp:struct:`hipDeviceProp_t` also has a field ``executionUnitsPerMultiprocessor``.
252252

253-
Porting from CUDA __launch_bounds__
254-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
255-
256-
CUDA also defines a ``__launch_bounds__`` qualifier which works similar to HIP's
257-
implementation, however it uses different parameters:
258-
259-
.. code-block:: cpp
260-
261-
__launch_bounds__(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MULTIPROCESSOR)
262-
263-
The first parameter is the same as HIP's implementation, but
264-
``MIN_BLOCKS_PER_MULTIPROCESSOR`` must be converted to
265-
``MIN_WARPS_PER_EXECUTION``, which uses warps and execution units rather than
266-
blocks and multiprocessors. This conversion is performed automatically by
267-
:doc:`HIPIFY <hipify:index>`, or can be done manually with the following
268-
equation.
269-
270-
.. code-block:: cpp
271-
272-
MIN_WARPS_PER_EXECUTION_UNIT = (MIN_BLOCKS_PER_MULTIPROCESSOR * MAX_THREADS_PER_BLOCK) / warpSize
273-
274-
Directly controlling the warps per execution unit makes it easier to reason
275-
about the occupancy, unlike with blocks, where the occupancy depends on the
276-
block size.
277-
278-
The use of execution units rather than multiprocessors also provides support for
279-
architectures with multiple execution units per multiprocessor. For example, the
280-
AMD GCN architecture has 4 execution units per multiprocessor.
281-
282-
maxregcount
283-
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
284-
285-
Unlike ``nvcc``, ``amdclang++`` does not support the ``--maxregcount`` option.
286-
Instead, users are encouraged to use the ``__launch_bounds__`` directive since
287-
the parameters are more intuitive and portable than micro-architecture details
288-
like registers. The directive allows per-kernel control.
289-
290253
Memory space qualifiers
291254
================================================================================
292255

docs/how-to/hip_porting_driver_api.rst

Lines changed: 77 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,33 +1,30 @@
11
.. meta::
22
:description: This chapter presents how to port the CUDA driver API and showcases equivalent operations in HIP.
3-
:keywords: AMD, ROCm, HIP, CUDA, driver API
3+
:keywords: AMD, ROCm, HIP, CUDA, driver API, porting, port
44

55
.. _porting_driver_api:
66

77
*******************************************************************************
88
Porting CUDA driver API
99
*******************************************************************************
1010

11-
NVIDIA provides separate CUDA driver and runtime APIs. The two APIs have
12-
significant overlap in functionality:
13-
14-
* Both APIs support events, streams, memory management, memory copy, and error
15-
handling.
16-
17-
* Both APIs deliver similar performance.
11+
CUDA provides separate driver and runtime APIs. The two APIs generally provide
12+
the same functionality, however the driver API allows for more fine-grained
13+
control over initialization and context- and module-management. This is all
14+
taken care of implicitly by the runtime API.
1815

1916
* Driver API calls begin with the prefix ``cu``, while runtime API calls begin
2017
with the prefix ``cuda``. For example, the driver API contains
2118
``cuEventCreate``, while the runtime API contains ``cudaEventCreate``, which
2219
has similar functionality.
2320

24-
* The driver API defines a different, but largely overlapping, error code space
25-
than the runtime API and uses a different coding convention. For example, the
26-
driver API defines ``CUDA_ERROR_INVALID_VALUE``, while the runtime API defines
27-
``cudaErrorInvalidValue``.
21+
* The driver API offers two additional functionalities not directly provided by
22+
the runtime API: ``cuModule`` and ``cuCtx`` APIs.
2823

29-
The driver API offers two additional functionalities not provided by the runtime
30-
API: ``cuModule`` and ``cuCtx`` APIs.
24+
HIP does not explicitly provide two different APIs, the corresponding functions
25+
for the CUDA driver API are available in the HIP runtime API, and are usually
26+
prefixed with ``hipDrv``. The module and context functionality is available with
27+
the ``hipModule`` and ``hipCtx`` prefix.
3128

3229
cuModule API
3330
================================================================================
@@ -123,8 +120,8 @@ HIPIFY translation of CUDA driver API
123120
The HIPIFY tools convert CUDA driver APIs for streams, events, modules, devices, memory management, context, and the profiler to the equivalent HIP calls. For example, ``cuEventCreate`` is translated to ``hipEventCreate``.
124121
HIPIFY tools also convert error codes from the driver namespace and coding conventions to the equivalent HIP error code. HIP unifies the APIs for these common functions.
125122

126-
The memory copy API requires additional explanation. The CUDA driver includes the memory direction in the name of the API (``cuMemcpyH2D``), while the CUDA driver API provides a single memory copy API with a parameter that specifies the direction. It also supports a "default" direction where the runtime determines the direction automatically.
127-
HIP provides APIs with both styles, for example, ``hipMemcpyH2D`` as well as ``hipMemcpy``.
123+
The memory copy API requires additional explanation. The CUDA driver includes the memory direction in the name of the API (``cuMemcpyH2D``), while the CUDA runtime API provides a single memory copy API with a parameter that specifies the direction. It also supports a "default" direction where the runtime determines the direction automatically.
124+
HIP provides both versions, for example, ``hipMemcpyH2D`` as well as ``hipMemcpy``.
128125
The first version might be faster in some cases because it avoids any host overhead to detect the different memory directions.
129126

130127
HIP defines a single error space and uses camel case for all errors (i.e. ``hipErrorInvalidValue``).
@@ -547,3 +544,67 @@ The HIP version number is defined as an integer:
547544
.. code-block:: cpp
548545
549546
HIP_VERSION=HIP_VERSION_MAJOR * 10000000 + HIP_VERSION_MINOR * 100000 + HIP_VERSION_PATCH
547+
548+
********************************************************************************
549+
CU_POINTER_ATTRIBUTE_MEMORY_TYPE
550+
********************************************************************************
551+
552+
To get the pointer's memory type in HIP, developers should use
553+
:cpp:func:`hipPointerGetAttributes`. First parameter of the function is
554+
`hipPointerAttribute_t`. Its ``type`` member variable indicates whether the
555+
memory pointed to is allocated on the device or the host.
556+
557+
For example:
558+
559+
.. code-block:: cpp
560+
561+
double * ptr;
562+
hipMalloc(&ptr, sizeof(double));
563+
hipPointerAttribute_t attr;
564+
hipPointerGetAttributes(&attr, ptr); /*attr.type is hipMemoryTypeDevice*/
565+
if(attr.type == hipMemoryTypeDevice)
566+
std::cout << "ptr is of type hipMemoryTypeDevice" << std::endl;
567+
568+
double* ptrHost;
569+
hipHostMalloc(&ptrHost, sizeof(double));
570+
hipPointerAttribute_t attr;
571+
hipPointerGetAttributes(&attr, ptrHost); /*attr.type is hipMemoryTypeHost*/
572+
if(attr.type == hipMemorTypeHost)
573+
std::cout << "ptrHost is of type hipMemoryTypeHost" << std::endl;
574+
575+
Note that ``hipMemoryType`` enum values are different from the
576+
``cudaMemoryType`` enum values.
577+
578+
For example, on AMD platform, `hipMemoryType` is defined in `hip_runtime_api.h`,
579+
580+
.. code-block:: cpp
581+
582+
typedef enum hipMemoryType {
583+
hipMemoryTypeHost = 0, ///< Memory is physically located on host
584+
hipMemoryTypeDevice = 1, ///< Memory is physically located on device. (see deviceId for specific device)
585+
hipMemoryTypeArray = 2, ///< Array memory, physically located on device. (see deviceId for specific device)
586+
hipMemoryTypeUnified = 3, ///< Not used currently
587+
hipMemoryTypeManaged = 4 ///< Managed memory, automaticallly managed by the unified memory system
588+
} hipMemoryType;
589+
590+
Looking into CUDA toolkit, it defines `cudaMemoryType` as following,
591+
592+
.. code-block:: cpp
593+
594+
enum cudaMemoryType
595+
{
596+
cudaMemoryTypeUnregistered = 0, // Unregistered memory.
597+
cudaMemoryTypeHost = 1, // Host memory.
598+
cudaMemoryTypeDevice = 2, // Device memory.
599+
cudaMemoryTypeManaged = 3, // Managed memory
600+
}
601+
602+
In this case, memory type translation for `hipPointerGetAttributes` needs to be handled properly on NVIDIA platform to get the correct memory type in CUDA, which is done in the file `nvidia_hip_runtime_api.h`.
603+
604+
So in any HIP applications which use HIP APIs involving memory types, developers should use `#ifdef` in order to assign the correct enum values depending on NVIDIA or AMD platform.
605+
606+
As an example, please see the code from the `link <https://github.com/ROCm/hip-tests/tree/develop/catch/unit/memory/hipMemcpyParam2D.cc>`_.
607+
608+
With the `#ifdef` condition, HIP APIs work as expected on both AMD and NVIDIA platforms.
609+
610+
Note, `cudaMemoryTypeUnregistered` is currently not supported as `hipMemoryType` enum, due to HIP functionality backward compatibility.

docs/how-to/hip_porting_guide.rst

Lines changed: 56 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,21 @@ suggestions on how to port CUDA code and work through common issues.
1414
Porting a CUDA Project
1515
********************************************************************************
1616

17+
Mixing HIP and CUDA code results in valid CUDA code. This enables users to
18+
incrementally port CUDA to HIP, and still compile and test the code during the
19+
transition.
20+
21+
The only notable exception is ``hipError_t``, which is not just an alias to
22+
``cudaError_t``. In these cases HIP provides functions to convert between the
23+
error code spaces:
24+
25+
:cpp:func:`hipErrorToCudaError`
26+
:cpp:func:`hipCUDAErrorTohipError`
27+
:cpp:func:`hipCUResultTohipError`
28+
1729
General Tips
1830
================================================================================
1931

20-
* You can incrementally port pieces of the code to HIP while leaving the rest in CUDA. HIP is just a thin layer over CUDA, so the two languages can interoperate.
2132
* Starting to port on an NVIDIA machine is often the easiest approach, as the code can be tested for functionality and performance even if not fully ported to HIP.
2233
* Once the CUDA code is ported to HIP and is running on the CUDA machine, compile the HIP code for an AMD machine.
2334
* You can handle platform-specific features through conditional compilation or by adding them to the open-source HIP infrastructure.
@@ -533,16 +544,6 @@ supports, together with the corresponding macros and device properties.
533544
- ``hasDynamicParallelism``
534545
- Ability to launch a kernel from within a kernel
535546

536-
********************************************************************************
537-
Finding HIP
538-
********************************************************************************
539-
540-
Makefiles can use the following syntax to conditionally provide a default HIP_PATH if one does not exist:
541-
542-
.. code-block:: shell
543-
544-
HIP_PATH ?= $(shell hipconfig --path)
545-
546547
********************************************************************************
547548
Compilation
548549
********************************************************************************
@@ -555,6 +556,12 @@ options are appropriate for the target compiler.
555556
``hipconfig`` is a helpful tool in identifying the current systems platform,
556557
compiler and runtime. It can also help set options appropriately.
557558

559+
As an example, it can provide a path to HIP, in Makefiles for example:
560+
561+
.. code-block:: shell
562+
563+
HIP_PATH ?= $(shell hipconfig --path)
564+
558565
HIP Headers
559566
================================================================================
560567

@@ -602,3 +609,41 @@ platforms and architectures. The ``warpSize`` built-in should be used in device
602609
code, while the host can query it during runtime via the device properties. See
603610
the :ref:`HIP language extension for warpSize <warp_size>` for information on
604611
how to write portable wave-aware code.
612+
613+
********************************************************************************
614+
Porting from CUDA __launch_bounds__
615+
********************************************************************************
616+
617+
CUDA also defines a ``__launch_bounds__`` qualifier which works similar to HIP's
618+
implementation, however it uses different parameters:
619+
620+
.. code-block:: cpp
621+
622+
__launch_bounds__(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MULTIPROCESSOR)
623+
624+
The first parameter is the same as HIP's implementation, but
625+
``MIN_BLOCKS_PER_MULTIPROCESSOR`` must be converted to
626+
``MIN_WARPS_PER_EXECUTION``, which uses warps and execution units rather than
627+
blocks and multiprocessors. This conversion is performed automatically by
628+
:doc:`HIPIFY <hipify:index>`, or can be done manually with the following
629+
equation.
630+
631+
.. code-block:: cpp
632+
633+
MIN_WARPS_PER_EXECUTION_UNIT = (MIN_BLOCKS_PER_MULTIPROCESSOR * MAX_THREADS_PER_BLOCK) / warpSize
634+
635+
Directly controlling the warps per execution unit makes it easier to reason
636+
about the occupancy, unlike with blocks, where the occupancy depends on the
637+
block size.
638+
639+
The use of execution units rather than multiprocessors also provides support for
640+
architectures with multiple execution units per multiprocessor. For example, the
641+
AMD GCN architecture has 4 execution units per multiprocessor.
642+
643+
maxregcount
644+
================================================================================
645+
646+
Unlike ``nvcc``, ``amdclang++`` does not support the ``--maxregcount`` option.
647+
Instead, users are encouraged to use the ``__launch_bounds__`` directive since
648+
the parameters are more intuitive and portable than micro-architecture details
649+
like registers. The directive allows per-kernel control.

0 commit comments

Comments
 (0)