[SYCL][ext] Define and Implement sycl_ext_tensor_map #16247

ldrumm · 2024-12-03T17:04:08Z

This is a fairly mechanical implementation of the basic infrastructure
required to access CUDA TMA descriptors from within SYCL kernels, while
initializing them on the host. The new feature exposes two new classes
and associated support structure in
sycl::ext::codeplay::experimental::cuda.

There's some ugliness involved to make this work on account of the way
NVIDIA implemented this basic feature, but it's all in the name of
{legitimate-field-of-endeavour}.

ldrumm · 2024-12-03T17:07:15Z

This depends on Hugh's work for Unified Runtime here

AlexeySachkov

DeviceConfigFile changes LGTM

ldrumm · 2024-12-03T17:45:52Z

https://github.com/intel/llvm/actions/runs/12144906122/job/33865361253?pr=16247

I saw this build error locally due to a stale build tree. Are we not doing clean checkouts for CI?

sarnex · 2024-12-03T17:47:39Z

No we used cached checkouts

ldrumm · 2024-12-03T17:52:27Z

No we used cached checkouts

How do I clear them?

sarnex · 2024-12-03T17:54:05Z

I have to log into the runners and do it manually, but I don't know if other PRs will end up in the cache and cause the same problem. I'll try, give me a sec.

ldrumm · 2024-12-03T17:58:22Z

I have to log into the runners and do it manually, but I don't know if other PRs will end up in the cache and cause the same problem. I'll try, give me a sec.

Thanks. I'll see if I can track down why the ur_api_funcs.def isn't considered out of date when the UR repo fetchcontent changes

ldrumm · 2024-12-03T18:06:03Z

add_custom_command(
  OUTPUT  ${OUT_HEADERS_IN_SYCL_DIR}
          ${OUT_HEADERS_IN_CL_DIR}
          ${OUT_HEADERS_IN_STD_DIR}
          ${OUT_HEADERS_IN_SYCLCOMPAT_DIR}
  DEPENDS ${HEADERS_IN_SYCL_DIR}
          ${HEADERS_IN_CL_DIR}
          ${HEADERS_IN_STD_DIR}
          ${HEADERS_IN_SYCLCOMPAT_DIR}
  COMMAND ${CMAKE_COMMAND} -E copy_directory ${sycl_inc_dir}/sycl ${SYCL_INCLUDE_BUILD_DIR}/sycl
  COMMAND ${CMAKE_COMMAND} -E copy_directory ${sycl_inc_dir}/CL ${SYCL_INCLUDE_BUILD_DIR}/CL
  COMMAND ${CMAKE_COMMAND} -E copy_directory ${sycl_inc_dir}/std ${SYCL_INCLUDE_BUILD_DIR}/std
  COMMAND ${CMAKE_COMMAND} -E copy_directory ${sycl_inc_dir}/syclcompat ${SYCL_INCLUDE_BUILD_DIR}/syclcompat
  COMMAND ${CMAKE_COMMAND} -E copy ${sycl_inc_dir}/syclcompat.hpp ${SYCL_INCLUDE_BUILD_DIR}/syclcompat.hpp
  COMMAND ${CMAKE_COMMAND} -E copy ${UNIFIED_RUNTIME_INCLUDE_DIR}/ur_api.h ${SYCL_INCLUDE_BUILD_DIR}
  COMMAND ${CMAKE_COMMAND} -E copy ${UNIFIED_RUNTIME_INCLUDE_DIR}/ur_api_funcs.def ${SYCL_INCLUDE_BUILD_DIR}
  COMMAND ${CMAKE_COMMAND} -E copy ${UNIFIED_RUNTIME_INCLUDE_DIR}/ur_print.hpp ${SYCL_INCLUDE_BUILD_DIR}
  COMMENT "Copying SYCL headers ...")

Yeah there's no dependency on the input files for the UR headers. I'll submit a patch

sarnex · 2024-12-03T18:20:50Z

Cool. I tried clearing the cache but it didn't work because we check out intel/llvm HEAD first, so we hit the problem again. Ping me on the PR for the CMake fix and I'll try to fast track it

ldrumm · 2024-12-04T15:08:30Z

Yeah there's no dependency on the input files for the UR headers. I'll submit a patch

#16261

sycl/doc/extensions/experimental/sycl_ext_codeplay_cuda_tensor_map.asciidoc

sycl/include/sycl/ext/codeplay/experimental/cuda_tensor_map.hpp

againull

Could you please add tests for this feature.

gmlueck · 2024-12-06T17:41:10Z

sycl/doc/extensions/experimental/sycl_ext_codeplay_cuda_tensor_map.asciidoc

+should not rely on APIs defined in this specification.* It is likely to be
+generalized and significantly change in later revisions as more backend vendors
+implement analogous features of more or less expressivity and generality than
+shown here.


I'm not wild about adding this new category of "oneapi-only" extension. I'm guessing this is for XeTLA? If that is the only library we expect to use this, why not add the support directly in that library, rather than adding a SYCL API for it? I imagine this would end up calling CUDA APIs or inline asm statements, but I think XeTLA does this already for other devices. I'd feel differently if we were adding a general API that other applications could make use of, but that's not what we're doing in this PR.

For CUTLASS, actually - but yeah it's not great to be so limiting.

We discussed a couple of ways to do this (with one of my suggestions being interop), but it seems there's little appetite to be including CUDA specific headers in these ports.

To be clear, there's no reason we really need to be so limiting with our wording here, I just added it since I'm not completely confident it has uses outside of the CUTLASS case and wanted to limit maintenance burden.

Would relaxing the language here be appropriate?

I've used the standard boilerplate language for the Status section

This is a fairly mechanical implementation of the basic infrastructure required to access CUDA TMA descriptors from within SYCL kernels, while initializing them on the host. The new feature exposes two new classes and associated support structure in `sycl::ext::codeplay::experimental::cuda`. There's some ugliness involved to make this work on account of the way NVIDIA implemented this basic feature, but it's all in the name of {legitimate-field-of-endeavour}.

ldrumm · 2024-12-10T19:09:06Z

Could you please add tests for this feature.

@againull Good catch. I've added aspect and macro tests. The use of this feature requests sm90+ GPU and inline assembly, so I've ignored that part. Hope that's enough

ldrumm requested review from a team as code owners December 3, 2024 17:04

ldrumm requested a review from againull December 3, 2024 17:04

ldrumm had a problem deploying to WindowsCILock December 3, 2024 17:05 — with GitHub Actions Error

ldrumm requested review from AerialMantis and Naghasan December 3, 2024 17:07

AlexeySachkov approved these changes Dec 3, 2024

View reviewed changes

ldrumm force-pushed the luke/ext_cuda_tensor_map branch from 3d2c85f to 16f22f6 Compare December 3, 2024 17:24

ldrumm had a problem deploying to WindowsCILock December 3, 2024 17:25 — with GitHub Actions Failure

ldrumm mentioned this pull request Dec 4, 2024

[SYCL] Fix missing dependency on UR headers #16261

Merged

ldrumm had a problem deploying to WindowsCILock December 4, 2024 15:10 — with GitHub Actions Failure

ldrumm force-pushed the luke/ext_cuda_tensor_map branch from e169672 to 0033748 Compare December 4, 2024 17:58

ldrumm had a problem deploying to WindowsCILock December 4, 2024 18:00 — with GitHub Actions Failure

ldrumm force-pushed the luke/ext_cuda_tensor_map branch from 0033748 to 2b161d2 Compare December 4, 2024 20:28

ldrumm had a problem deploying to WindowsCILock December 4, 2024 20:29 — with GitHub Actions Failure

ldrumm force-pushed the luke/ext_cuda_tensor_map branch from 2b161d2 to 0c32697 Compare December 5, 2024 13:07

ldrumm had a problem deploying to WindowsCILock December 5, 2024 13:08 — with GitHub Actions Failure

ldrumm temporarily deployed to WindowsCILock December 5, 2024 14:21 — with GitHub Actions Inactive

Naghasan reviewed Dec 5, 2024

View reviewed changes

ldrumm force-pushed the luke/ext_cuda_tensor_map branch from 43d7ed1 to c8dee17 Compare December 5, 2024 16:41

Naghasan approved these changes Dec 5, 2024

View reviewed changes

ldrumm temporarily deployed to WindowsCILock December 5, 2024 16:42 — with GitHub Actions Inactive

ldrumm temporarily deployed to WindowsCILock December 5, 2024 20:00 — with GitHub Actions Inactive

againull reviewed Dec 6, 2024

View reviewed changes

gmlueck reviewed Dec 6, 2024

View reviewed changes

ldrumm force-pushed the luke/ext_cuda_tensor_map branch from c8dee17 to 1fba3c9 Compare December 10, 2024 19:07

ldrumm temporarily deployed to WindowsCILock December 10, 2024 19:09 — with GitHub Actions Inactive

ldrumm temporarily deployed to WindowsCILock December 10, 2024 20:57 — with GitHub Actions Inactive

ldrumm mentioned this pull request Dec 24, 2024

CUTensorMap is only in CUDA v12 oneapi-src/unified-runtime#2502

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL][ext] Define and Implement sycl_ext_tensor_map #16247

[SYCL][ext] Define and Implement sycl_ext_tensor_map #16247

ldrumm commented Dec 3, 2024

ldrumm commented Dec 3, 2024

AlexeySachkov left a comment

ldrumm commented Dec 3, 2024

sarnex commented Dec 3, 2024

ldrumm commented Dec 3, 2024

sarnex commented Dec 3, 2024 •

edited

Loading

ldrumm commented Dec 3, 2024

ldrumm commented Dec 3, 2024

sarnex commented Dec 3, 2024

ldrumm commented Dec 4, 2024

againull left a comment

gmlueck Dec 6, 2024

ldrumm Dec 9, 2024

ldrumm Dec 10, 2024

ldrumm commented Dec 10, 2024

[SYCL][ext] Define and Implement sycl_ext_tensor_map #16247

Are you sure you want to change the base?

[SYCL][ext] Define and Implement sycl_ext_tensor_map #16247

Conversation

ldrumm commented Dec 3, 2024

ldrumm commented Dec 3, 2024

AlexeySachkov left a comment

Choose a reason for hiding this comment

ldrumm commented Dec 3, 2024

sarnex commented Dec 3, 2024

ldrumm commented Dec 3, 2024

sarnex commented Dec 3, 2024 • edited Loading

ldrumm commented Dec 3, 2024

ldrumm commented Dec 3, 2024

sarnex commented Dec 3, 2024

ldrumm commented Dec 4, 2024

againull left a comment

Choose a reason for hiding this comment

gmlueck Dec 6, 2024

Choose a reason for hiding this comment

ldrumm Dec 9, 2024

Choose a reason for hiding this comment

ldrumm Dec 10, 2024

Choose a reason for hiding this comment

ldrumm commented Dec 10, 2024

sarnex commented Dec 3, 2024 •

edited

Loading