From bd882dcd506ad3e1999fedc5712f1b01291836ca Mon Sep 17 00:00:00 2001 From: Dmitriy Sobolev Date: Fri, 7 Feb 2025 16:31:53 +0000 Subject: [PATCH] Allow specifying SYCL barrier type --- documentation/library_guide/macros.rst | 182 +++++++++--------- .../oneapi/dpl/pstl/hetero/dpcpp/sycl_defs.h | 11 +- 2 files changed, 105 insertions(+), 88 deletions(-) diff --git a/documentation/library_guide/macros.rst b/documentation/library_guide/macros.rst index a18ba670ee5..944b71c2929 100644 --- a/documentation/library_guide/macros.rst +++ b/documentation/library_guide/macros.rst @@ -50,89 +50,99 @@ Additional Macros Use these macros to control aspects of |onedpl_short| usage. You can set them in your program code before including |onedpl_short| headers. -================================== ============================== -Macro Description -================================== ============================== -``PSTL_USE_NONTEMPORAL_STORES`` This macro enables the use of ``#pragma vector nontemporal`` - for write-only data when algorithms such as ``std::copy``, ``std::fill``, etc., - are executed with unsequenced policies. - For further details about the pragma, see the |vector_pragma|_. - If the macro evaluates to a non-zero value, - the use of ``#pragma vector nontemporal`` is enabled. - By default, the macro is not defined. - - Using this macro may have the same effect on the implementation of parallel - algorithms in the C++ standard libraries of GCC and LLVM. ----------------------------------- ------------------------------ -``PSTL_USAGE_WARNINGS`` This macro enables Parallel STL to - emit compile-time messages, such as warnings - about an algorithm not supporting a certain execution policy. - When set to 1, the macro allows the implementation to emit - usage warnings. When the macro is not defined (by default) - or evaluates to zero, usage warnings are disabled. - - Using this macro may have the same effect on the implementation of parallel - algorithms in the C++ standard libraries of GCC and LLVM. ----------------------------------- ------------------------------ -``ONEDPL_USE_TBB_BACKEND`` This macro controls the use of |onetbb_long| or |tbb_long| for parallel - execution policies (``par`` and ``par_unseq``). - - When the macro evaluates to a non-zero value, or when it is not defined (by default) - and no other parallel backends are explicitly chosen, algorithms with parallel policies - are executed using the |onetbb_short| or |tbb_short| library. - Setting the macro to 0 disables use of TBB API for parallel execution and is recommended - for code that should not depend on the presence of the |onetbb_short| or |tbb_short| library. - - If all parallel backends are disabled by setting respective macros to 0, algorithms - with parallel policies are executed sequentially by the calling thread. ----------------------------------- ------------------------------ -``ONEDPL_USE_OPENMP_BACKEND`` This macro controls the use of OpenMP* for parallel execution policies (``par`` and ``par_unseq``). - - When the macro evaluates to a non-zero value, algorithms with parallel policies are executed - using OpenMP unless the TBB backend is explicitly enabled (that is, the TBB backend takes - precedence over the OpenMP backend). - When the macro is not defined (by default) and no other parallel backends are chosen, - a dedicated compiler option to enable OpenMP (such as ``-fopenmp``) also enables its use - for algorithms with parallel policies. - Setting the macro to 0 disables use of OpenMP for parallel execution. - - If all parallel backends are disabled by setting respective macros to 0, algorithms - with parallel policies are executed sequentially by the calling thread. ----------------------------------- ------------------------------ -``ONEDPL_USE_DPCPP_BACKEND`` This macro enables the use of device execution policies. - - When the macro is not defined (default), - device policies are enabled only if SYCL support can be detected; - otherwise, they are disabled. - If the macro is set to a non-zero value, device policies are enabled unconditionally. - Setting the macro to 0 disables device policies. - - When device policies are disabled, no SYCL dependency is introduced, - and their usage will lead to compilation errors. ----------------------------------- ------------------------------ -``ONEDPL_USE_PREDEFINED_POLICIES`` This macro enables the use of predefined device policy objects, - such as ``dpcpp_default`` and ``dpcpp_fpga``. When the macro is not defined (by default) - or evaluates to non-zero, predefined policies objects can be used. - When the macro is set to 0, predefined policies objects and make functions - without arguments (``make_device_policy()`` and ``make_fpga_policy()``) are not available. ----------------------------------- ------------------------------ -``ONEDPL_ALLOW_DEFERRED_WAITING`` This macro allows waiting for completion of certain algorithms executed with - device policies to be deferred. (Disabled by default.) - - When the macro evaluates to non-zero, a call to a oneDPL algorithm with - a device policy might return before the computation completes on the device. - - .. Warning:: Before accessing data produced or modified by the call, waiting - for completion of all tasks in the corresponding SYCL queue is required; - otherwise, the program behavior is undefined. ----------------------------------- ------------------------------ -``ONEDPL_FPGA_DEVICE`` Use this macro to build your code containing |onedpl_short| parallel - algorithms for FPGA devices. (Disabled by default.) ----------------------------------- ------------------------------ -``ONEDPL_FPGA_EMULATOR`` Use this macro to build your code containing Parallel STL - algorithms for FPGA emulation device. (Disabled by default.) - - .. Note:: Define ``ONEDPL_FPGA_DEVICE`` and ``ONEDPL_FPGA_EMULATOR`` macros in the same - application to run on a FPGA emulation device. - Define only the ``ONEDPL_FPGA_DEVICE`` macro to run on a FPGA hardware device. -================================== ============================== +==================================== ============================== +Macro Description +==================================== ============================== +``PSTL_USE_NONTEMPORAL_STORES`` This macro enables the use of ``#pragma vector nontemporal`` + for write-only data when algorithms such as ``std::copy``, ``std::fill``, etc., + are executed with unsequenced policies. + For further details about the pragma, see the |vector_pragma|_. + If the macro evaluates to a non-zero value, + the use of ``#pragma vector nontemporal`` is enabled. + By default, the macro is not defined. + + Using this macro may have the same effect on the implementation of parallel + algorithms in the C++ standard libraries of GCC and LLVM. +------------------------------------ ------------------------------ +``PSTL_USAGE_WARNINGS`` This macro enables Parallel STL to + emit compile-time messages, such as warnings + about an algorithm not supporting a certain execution policy. + When set to 1, the macro allows the implementation to emit + usage warnings. When the macro is not defined (by default) + or evaluates to zero, usage warnings are disabled. + + Using this macro may have the same effect on the implementation of parallel + algorithms in the C++ standard libraries of GCC and LLVM. +------------------------------------ ------------------------------ +``ONEDPL_USE_TBB_BACKEND`` This macro controls the use of |onetbb_long| or |tbb_long| for parallel + execution policies (``par`` and ``par_unseq``). + + When the macro evaluates to a non-zero value, or when it is not defined (by default) + and no other parallel backends are explicitly chosen, algorithms with parallel policies + are executed using the |onetbb_short| or |tbb_short| library. + Setting the macro to 0 disables use of TBB API for parallel execution and is recommended + for code that should not depend on the presence of the |onetbb_short| or |tbb_short| library. + + If all parallel backends are disabled by setting respective macros to 0, algorithms + with parallel policies are executed sequentially by the calling thread. +------------------------------------ ------------------------------ +``ONEDPL_USE_OPENMP_BACKEND`` This macro controls the use of OpenMP* for parallel execution policies (``par`` and ``par_unseq``). + + When the macro evaluates to a non-zero value, algorithms with parallel policies are executed + using OpenMP unless the TBB backend is explicitly enabled (that is, the TBB backend takes + precedence over the OpenMP backend). + When the macro is not defined (by default) and no other parallel backends are chosen, + a dedicated compiler option to enable OpenMP (such as ``-fopenmp``) also enables its use + for algorithms with parallel policies. + Setting the macro to 0 disables use of OpenMP for parallel execution. + + If all parallel backends are disabled by setting respective macros to 0, algorithms + with parallel policies are executed sequentially by the calling thread. +------------------------------------ ------------------------------ +``ONEDPL_USE_DPCPP_BACKEND`` This macro enables the use of device execution policies. + + When the macro is not defined (default), + device policies are enabled only if SYCL support can be detected; + otherwise, they are disabled. + If the macro is set to a non-zero value, device policies are enabled unconditionally. + Setting the macro to 0 disables device policies. + + When device policies are disabled, no SYCL dependency is introduced, + and their usage will lead to compilation errors. +------------------------------------ ------------------------------ +``ONEDPL_USE_PREDEFINED_POLICIES`` This macro enables the use of predefined device policy objects, + such as ``dpcpp_default`` and ``dpcpp_fpga``. When the macro is not defined (by default) + or evaluates to non-zero, predefined policies objects can be used. + When the macro is set to 0, predefined policies objects and make functions + without arguments (``make_device_policy()`` and ``make_fpga_policy()``) are not available. +------------------------------------ ------------------------------ +``ONEDPL_ALLOW_DEFERRED_WAITING`` This macro allows waiting for completion of certain algorithms executed with + device policies to be deferred. (Disabled by default.) + + When the macro evaluates to non-zero, a call to a oneDPL algorithm with + a device policy might return before the computation completes on the device. + + .. Warning:: Before accessing data produced or modified by the call, waiting + for completion of all tasks in the corresponding SYCL queue is required; + otherwise, the program behavior is undefined. +------------------------------------ ------------------------------ +``ONEDPL_FPGA_DEVICE`` Use this macro to build your code containing |onedpl_short| parallel + algorithms for FPGA devices. (Disabled by default.) +------------------------------------ ------------------------------ +``ONEDPL_FPGA_EMULATOR`` Use this macro to build your code containing Parallel STL + algorithms for FPGA emulation device. (Disabled by default.) + + .. Note:: Define ``ONEDPL_FPGA_DEVICE`` and ``ONEDPL_FPGA_EMULATOR`` macros in the same + application to run on a FPGA emulation device. + Define only the ``ONEDPL_FPGA_DEVICE`` macro to run on a FPGA hardware device. +------------------------------------ ------------------------------ +``ONEDPL_USE_SYCL121_GROUP_BARRIER`` The macro controls which API for group barriers oneDPL uses, + which can be either as defined in SYCL 1.2.1 or as in SYCL 2020. + It affects algorithms that use device execution policies. + + Set this macro to a non-zero value to enable SYCL 1.2.1 group barriers. + The default value is 1 when using the oneAPI DPC++ Compiler and 0 otherwise. + + .. Note:: Depending on a GPU driver, SYCL 1.2.1 group barriers can provide better performance + on Intel GPUs. The default value may change in future releases in favor of SYCL 2020 group barriers. +==================================== ============================== diff --git a/include/oneapi/dpl/pstl/hetero/dpcpp/sycl_defs.h b/include/oneapi/dpl/pstl/hetero/dpcpp/sycl_defs.h index c35306e4dca..398d4020f3e 100644 --- a/include/oneapi/dpl/pstl/hetero/dpcpp/sycl_defs.h +++ b/include/oneapi/dpl/pstl/hetero/dpcpp/sycl_defs.h @@ -227,11 +227,18 @@ __get_accessor_size(const _Accessor& __accessor) } // TODO: switch to SYCL 2020 with DPC++ compiler. -// SYCL 1.2.1 version is used due to better performance on Intel GPUs. +// SYCL 1.2.1 version is used due to having an API with a local memory fence, +// which gives better performance on Intel GPUs. // The performance gap is negligible since // https://github.com/intel/intel-graphics-compiler/commit/ed639f68d142bc963a7b626badc207a42fb281cb (Aug 20, 2024) // But the fix is not a part of the LTS GPU drivers (Linux) yet. -#define ONEDPL_USE_SYCL121_GROUP_BARRIER 1 +#if !defined(ONEDPL_USE_SYCL121_GROUP_BARRIER) +# if _ONEDPL_LIBSYCL_VERSION +# define ONEDPL_USE_SYCL121_GROUP_BARRIER 1 +# else +# define ONEDPL_USE_SYCL121_GROUP_BARRIER 0 +# endif +#endif #if ONEDPL_USE_SYCL121_GROUP_BARRIER template