-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Technical debt changes #1957
Technical debt changes #1957
Conversation
…eaders Using anonymous namespace in header files is against best C++ practices, since entities in anonymous namespace have internal linkage, and every translation unit that includes the header file would have its own copy, increasing compilation time and bloating the binary size.
Using anonymous namespace in headers is against best C++ practices due to internal linkage of entities in that namespace.
Avoid using comparator type to form kernel name types for iota and map_back kernels (as they do not depedent on comparator). This reduces the number of kernels generated during instantiation of template implementation functions.
Since vectors `ptrs` and `dels` are no longer needed after host_task submission, we might as well avoid the copying and use std::move in lambda capture initialization. Also renamed `Args` template pack to `UniquePtrTs`, and `args` template argument to `unique_ptrs`. Added comments next to each include to note the entity which requires it.
…late params Removed unncessary template parameters from kernel names submitted by these functions. As a consequence, the size of `_tensor_accumulation_impl` shared object reduced from 49'360'152 bytes to 36'422'888, that is, by almost 13MB.
@AlexanderKalistratov I pushed the changes you suggested to make in |
Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞 |
Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_415 ran successfully. |
The changes in this PR have also reduced the build time of dpctl, by about 10 minutes for me this branch
master
|
Direct calls to host_task to asynchronously deallocate USM temporary are replaced with call to async_smart_free which submits the host_task for us and transfers allocation ownership from smart pointer to the host task.
…_generic_impl to take packed shape/strides as const pointer
The unique_ptr owns the allocation ensuring no leaks during exception handling. This also allows async_smart_free to be used to schedule asynchronous deallocation of USM temporaries.
Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_420 ran successfully. |
Doing so reduces the binary size. Previously, '_tensor_sorting_impl' module has size 22'448'920 bytes, and '_tensor_sorting_radix_impl' has size 31'927'256 bytes. Total size was 54'376'176 bytes. After this change, the total size of the new '_tensor_sorting_impl' is 49'790'872, which is about 4Mb of savings.
Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_421 ran successfully. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM overall
Thank you @oleksandr-pavlyk
The PR proposes to implement fixes/improvements into the public CI: - run `pre-commit` and `building docs` workflows on Ubuntu-22.04 (`ubuntu-latest` now refers to Ubuntu-24.04 where the workflows fail as for now) - use `fail-fast: false` rather than `continue-on-error: false` in workflow with night tests run (some jobs in strategy matrix may fail due to mamba/conda issues) _Note, the workflow will be reported as failed if any job fails._ - trigger rerun tests step on failure of `run_tests` step only in night tests workflow - trigger resetup minoconda step on failure in night tests and conda package workflows - build name of running job based on python and OS from strategy matrix - disable night tests workflow to be run in forked repositories Also the PR includes adaptation to the recent interface changes in dpctl implemented in [gh-1957](IntelPython/dpctl#1957).
The PR proposes to implement fixes/improvements into the public CI: - run `pre-commit` and `building docs` workflows on Ubuntu-22.04 (`ubuntu-latest` now refers to Ubuntu-24.04 where the workflows fail as for now) - use `fail-fast: false` rather than `continue-on-error: false` in workflow with night tests run (some jobs in strategy matrix may fail due to mamba/conda issues) _Note, the workflow will be reported as failed if any job fails._ - trigger rerun tests step on failure of `run_tests` step only in night tests workflow - trigger resetup minoconda step on failure in night tests and conda package workflows - build name of running job based on python and OS from strategy matrix - disable night tests workflow to be run in forked repositories Also the PR includes adaptation to the recent interface changes in dpctl implemented in [gh-1957](IntelPython/dpctl#1957). 303a203
This PR checks in some technical debt changes.
Rationale: Entities defined in anonymous namespace have internal linkage, that is, they are generated in every translation units that includes the header file bloating the size of compiled binary, as well increasing compilation time
Rationale: Inline use requires kernel names to be distinct for function different instantiations, producing multiple copies of identical kernels, bloating the binary.
This change alone shed 13MB off the size of
_tensor_accumulation_impl
native extension (from 49MB to 36MB).