Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][Joint Matrix] Relax matrix_combinations error handling #12866

Conversation

dm-vodopyanov
Copy link
Contributor

This patch relaxes reporting of unsupported HW for matrix_combinations
query.
Before: exception thrown in case customer uses matrix_combinations
query on platform unsupported by ext_oneapi_device_architecture.
With this patch: customer gets empty vector from matrix_combinations
query on platform unsupported by ext_oneapi_device_architecture.

This patch relaxes reporting of unsupported HW for matrix_combinations
query.
Before: exception thrown in case customer uses matrix_combinations
query on platform unsupported by ext_oneapi_device_architecture.
With this patch: customer gets empty vector from matrix_combinations
query on platform unsupported by ext_oneapi_device_architecture.
@dm-vodopyanov dm-vodopyanov requested a review from a team as a code owner February 29, 2024 12:24
@dm-vodopyanov
Copy link
Contributor Author

@intel/sycl-matrix-reviewers: FYI

@dm-vodopyanov dm-vodopyanov requested a review from a team February 29, 2024 12:25
@dm-vodopyanov dm-vodopyanov changed the title [SYCL][Joint Matrx] Relax matrix_combinations error handling [SYCL][Joint Matrix] Relax matrix_combinations error handling Feb 29, 2024
@dkhaldi
Copy link
Contributor

dkhaldi commented Feb 29, 2024

@dm-vodopyanov is the idea to make matrix feature independent from ext_oneapi_device_architecture?
so if users use a new hardware not included yet in ext_oneapi_device_architecture, code does not crash at ext_oneapi_device_architecture level but rather they see no support of matrix for this hardware.

@AlexeySachkov
Copy link
Contributor

@dm-vodopyanov is the idea to make matrix feature independent from ext_oneapi_device_architecture? so if users use a new hardware not included yet in ext_oneapi_device_architecture, code does not crash at ext_oneapi_device_architecture level but rather they see no support of matrix for this hardware.

Please note that the extension spec currently doesn't describe the behavior if HW is unsupported by the implementation which can be considered a gap in there. My reading of the extension spec is that since it returns a vector, then it should never emit an exception if HW is unknown or unrecognizable, because an empty vector serves as an indication of "matrix functionality is not supported here".

If we think that exception for unknown HW is useful, then we should probably document it explicitly in the extension spec.

@dm-vodopyanov
Copy link
Contributor Author

dm-vodopyanov commented Feb 29, 2024

@dm-vodopyanov is the idea to make matrix feature independent from ext_oneapi_device_architecture? so if users use a new hardware not included yet in ext_oneapi_device_architecture, code does not crash at ext_oneapi_device_architecture level but rather they see no support of matrix for this hardware.

Yeah, that's correct. According to matrix spec, this query can return empty vector in case architecture is not supported, so it will be for architectures which are not supported by device_architecture extension

@dm-vodopyanov
Copy link
Contributor Author

@dm-vodopyanov is the idea to make matrix feature independent from ext_oneapi_device_architecture? so if users use a new hardware not included yet in ext_oneapi_device_architecture, code does not crash at ext_oneapi_device_architecture level but rather they see no support of matrix for this hardware.

Please note that the extension spec currently doesn't describe the behavior if HW is unsupported by the implementation which can be considered a gap in there. My reading of the extension spec is that since it returns a vector, then it should never emit an exception if HW is unknown or unrecognizable, because an empty vector serves as an indication of "matrix functionality is not supported here".

If we think that exception for unknown HW is useful, then we should probably document it explicitly in the extension spec.

Probably there is also an option to make the exception message friendlier if we want to keep throwing the exception

@dkhaldi
Copy link
Contributor

dkhaldi commented Feb 29, 2024

@dm-vodopyanov is the idea to make matrix feature independent from ext_oneapi_device_architecture? so if users use a new hardware not included yet in ext_oneapi_device_architecture, code does not crash at ext_oneapi_device_architecture level but rather they see no support of matrix for this hardware.

Please note that the extension spec currently doesn't describe the behavior if HW is unsupported by the implementation which can be considered a gap in there. My reading of the extension spec is that since it returns a vector, then it should never emit an exception if HW is unknown or unrecognizable, because an empty vector serves as an indication of "matrix functionality is not supported here".
If we think that exception for unknown HW is useful, then we should probably document it explicitly in the extension spec.

Probably there is also an option to make the exception message friendlier if we want to keep throwing the exception

As an optional kernel feature, the extension spec does mention explicitly throwing an exception when combination is not supported:

See in https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc#backend-support-status
“if the application submits a kernel using an unsupported joint_matrix type or calls joint_matrix_mad with an unsupported combination, the implementation throws a synchronous exception with the errc::kernel_not_supported error code as described in section 5.7.”

I like the fact that we make matrix independent from ext_oneapi_device_architecture, but matrix implementation relies today on ext_oneapi_device_architecture to generate such exception, right?
Is there a way to still generate an exception in matrix for cases where architectures are not part of ext_oneapi_device_architecture?

@AlexeySachkov
Copy link
Contributor

As an optional kernel feature, the extension spec does mention explicitly throwing an exception when combination is not supported:

The exception we see comes from get_info query and not from queue.submit.

I like the fact that we make matrix independent from ext_oneapi_device_architecture, but matrix implementation relies today on ext_oneapi_device_architecture to generate such exception, right?

The matrix combinations runtime query is dependent on ext_onapi_device_architecture implementation and we are not removing that. We are just fixing an edge case where the query is invoked on an unknown/unrecognizable HW.

@dkhaldi
Copy link
Contributor

dkhaldi commented Feb 29, 2024

As an optional kernel feature, the extension spec does mention explicitly throwing an exception when combination is not supported:

The exception we see comes from get_info query and not from queue.submit.

I like the fact that we make matrix independent from ext_oneapi_device_architecture, but matrix implementation relies today on ext_oneapi_device_architecture to generate such exception, right?

The matrix combinations runtime query is dependent on ext_onapi_device_architecture implementation and we are not removing that. We are just fixing an edge case where the query is invoked on an unknown/unrecognizable HW.

What happens when the query is invoked on a known HW to ext_onapi_device_architecture but the HW was not added to the matrix query yet? do we also generate an empty vector in this case?

@AlexeySachkov
Copy link
Contributor

What happens when the query is invoked on a known HW to ext_onapi_device_architecture but the HW was not added to the matrix query yet? do we also generate an empty vector in this case?

Yes. we generate an empty vector in that case

Copy link
Contributor

@dkhaldi dkhaldi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

sycl/source/detail/device_info.hpp Outdated Show resolved Hide resolved
Co-authored-by: Sergey Semenov <sergey.semenov@intel.com>
@dm-vodopyanov
Copy link
Contributor Author

Failures are not related:

Failed Tests (1):
  SYCL :: Graph/RecordReplay/host_task_multiple_deps.cpp

Submitted issue: #12941

Failed Tests (1):
  SYCL :: ESIMD/dword_local_accessor_atomic_smoke.cpp

Submitted issue: #12942

@dm-vodopyanov dm-vodopyanov merged commit c00305b into intel:sycl Mar 7, 2024
10 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants