Technical debt changes #1957

oleksandr-pavlyk · 2025-01-08T22:34:04Z

This PR checks in some technical debt changes.

Do not user anonymous namespace in C++ header files.

Rationale: Entities defined in anonymous namespace have internal linkage, that is, they are generated in every translation units that includes the header file bloating the size of compiled binary, as well increasing compilation time

Move inline kernel submissions for tasks that do not depend on all template parameters of the function to a separate function.

Rationale: Inline use requires kernel names to be distinct for function different instantiations, producing multiple copies of identical kernels, bloating the binary.

This change alone shed 13MB off the size of _tensor_accumulation_impl native extension (from 49MB to 36MB).

Have you provided a meaningful PR description?
Have you added a test, reproducer or referred to an issue with a reproducer?
Have you tested your changes locally for CPU and GPU devices?
Have you made sure that new changes do not introduce compiler warnings?
Have you checked performance impact of proposed changes?
Have you added documentation for your changes, if necessary?
Have you added your changes to the changelog?
If this PR is a work in progress, are you opening the PR as a draft?

…eaders Using anonymous namespace in header files is against best C++ practices, since entities in anonymous namespace have internal linkage, and every translation unit that includes the header file would have its own copy, increasing compilation time and bloating the binary size.

Using anonymous namespace in headers is against best C++ practices due to internal linkage of entities in that namespace.

Avoid using comparator type to form kernel name types for iota and map_back kernels (as they do not depedent on comparator). This reduces the number of kernels generated during instantiation of template implementation functions.

Since vectors `ptrs` and `dels` are no longer needed after host_task submission, we might as well avoid the copying and use std::move in lambda capture initialization. Also renamed `Args` template pack to `UniquePtrTs`, and `args` template argument to `unique_ptrs`. Added comments next to each include to note the entity which requires it.

…late params Removed unncessary template parameters from kernel names submitted by these functions. As a consequence, the size of `_tensor_accumulation_impl` shared object reduced from 49'360'152 bytes to 36'422'888, that is, by almost 13MB.

oleksandr-pavlyk · 2025-01-08T22:35:27Z

@AlexanderKalistratov I pushed the changes you suggested to make in async_smart_free in 80f288c

github-actions · 2025-01-08T23:12:55Z

Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞

github-actions · 2025-01-08T23:19:32Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_415 ran successfully.
Passed: 894
Failed: 2
Skipped: 118

coveralls · 2025-01-08T23:20:39Z

coverage: 87.715% (-0.001%) from 87.716%
when pulling 065413e on technical-debt-changes
into a7ca491 on master.

ndgrigorian · 2025-01-09T03:28:21Z

The changes in this PR have also reduced the build time of dpctl, by about 10 minutes for me

this branch

34m33.407s

master

45m25.987s

dpctl/tensor/libtensor/include/kernels/accumulators.hpp

Direct calls to host_task to asynchronously deallocate USM temporary are replaced with call to async_smart_free which submits the host_task for us and transfers allocation ownership from smart pointer to the host task.

…_generic_impl to take packed shape/strides as const pointer

The unique_ptr owns the allocation ensuring no leaks during exception handling. This also allows async_smart_free to be used to schedule asynchronous deallocation of USM temporaries.

github-actions · 2025-01-09T21:43:06Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_420 ran successfully.
Passed: 894
Failed: 2
Skipped: 118

Doing so reduces the binary size. Previously, '_tensor_sorting_impl' module has size 22'448'920 bytes, and '_tensor_sorting_radix_impl' has size 31'927'256 bytes. Total size was 54'376'176 bytes. After this change, the total size of the new '_tensor_sorting_impl' is 49'790'872, which is about 4Mb of savings.

github-actions · 2025-01-10T03:16:40Z

Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_421 ran successfully.
Passed: 893
Failed: 3
Skipped: 118

ndgrigorian

Changes LGTM overall

Thank you @oleksandr-pavlyk

The PR proposes to implement fixes/improvements into the public CI: - run `pre-commit` and `building docs` workflows on Ubuntu-22.04 (`ubuntu-latest` now refers to Ubuntu-24.04 where the workflows fail as for now) - use `fail-fast: false` rather than `continue-on-error: false` in workflow with night tests run (some jobs in strategy matrix may fail due to mamba/conda issues) _Note, the workflow will be reported as failed if any job fails._ - trigger rerun tests step on failure of `run_tests` step only in night tests workflow - trigger resetup minoconda step on failure in night tests and conda package workflows - build name of running job based on python and OS from strategy matrix - disable night tests workflow to be run in forked repositories Also the PR includes adaptation to the recent interface changes in dpctl implemented in [gh-1957](IntelPython/dpctl#1957).

The PR proposes to implement fixes/improvements into the public CI: - run `pre-commit` and `building docs` workflows on Ubuntu-22.04 (`ubuntu-latest` now refers to Ubuntu-24.04 where the workflows fail as for now) - use `fail-fast: false` rather than `continue-on-error: false` in workflow with night tests run (some jobs in strategy matrix may fail due to mamba/conda issues) _Note, the workflow will be reported as failed if any job fails._ - trigger rerun tests step on failure of `run_tests` step only in night tests workflow - trigger resetup minoconda step on failure in night tests and conda package workflows - build name of running job based on python and OS from strategy matrix - disable night tests workflow to be run in forked repositories Also the PR includes adaptation to the recent interface changes in dpctl implemented in [gh-1957](IntelPython/dpctl#1957). 303a203

oleksandr-pavlyk added 5 commits January 8, 2025 09:02

Replace use of anonymous namespace in headers

87052e2

Using anonymous namespace in headers is against best C++ practices due to internal linkage of entities in that namespace.

Reduced number of created iota and map_back kernels

869faef

Avoid using comparator type to form kernel name types for iota and map_back kernels (as they do not depedent on comparator). This reduces the number of kernels generated during instantiation of template implementation functions.

oleksandr-pavlyk requested a review from ndgrigorian as a code owner January 8, 2025 22:34

oleksandr-pavlyk requested a review from AlexanderKalistratov January 8, 2025 22:35

AlexanderKalistratov approved these changes Jan 9, 2025

View reviewed changes

dpctl/tensor/libtensor/include/kernels/accumulators.hpp Show resolved Hide resolved

dpctl/tensor/libtensor/include/kernels/accumulators.hpp Show resolved Hide resolved

oleksandr-pavlyk added 5 commits January 9, 2025 14:46

Add comment before call to unique_ptr::release method

8c167bf

Add comments explaining intention of unique_ptr::reset call

ce02c6c

Replace sycl::malloc_device with smart_malloc_device

fea54b6

Direct calls to host_task to asynchronously deallocate USM temporary are replaced with call to async_smart_free which submits the host_task for us and transfers allocation ownership from smart pointer to the host task.

Change signature of copy_and_cast_from_host_impl and copy_for_reshape…

9841f9e

…_generic_impl to take packed shape/strides as const pointer

Change to device_allocate_and_pack to return unique_ptr

9d77faf

The unique_ptr owns the allocation ensuring no leaks during exception handling. This also allows async_smart_free to be used to schedule asynchronous deallocation of USM temporaries.

ndgrigorian approved these changes Jan 10, 2025

View reviewed changes

oleksandr-pavlyk merged commit 9f8f90b into master Jan 10, 2025
60 of 61 checks passed

oleksandr-pavlyk deleted the technical-debt-changes branch January 10, 2025 12:38

antonwolfy mentioned this pull request Jan 10, 2025

Resolve issues in public CI IntelPython/dpnp#2254

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Technical debt changes #1957

Technical debt changes #1957

oleksandr-pavlyk commented Jan 8, 2025 •

edited

Loading

oleksandr-pavlyk commented Jan 8, 2025

github-actions bot commented Jan 8, 2025 •

edited

Loading

github-actions bot commented Jan 8, 2025

coveralls commented Jan 8, 2025 •

edited

Loading

ndgrigorian commented Jan 9, 2025 •

edited

Loading

github-actions bot commented Jan 9, 2025

github-actions bot commented Jan 10, 2025

ndgrigorian left a comment

Technical debt changes #1957

Technical debt changes #1957

Conversation

oleksandr-pavlyk commented Jan 8, 2025 • edited Loading

oleksandr-pavlyk commented Jan 8, 2025

github-actions bot commented Jan 8, 2025 • edited Loading

github-actions bot commented Jan 8, 2025

coveralls commented Jan 8, 2025 • edited Loading

ndgrigorian commented Jan 9, 2025 • edited Loading

github-actions bot commented Jan 9, 2025

github-actions bot commented Jan 10, 2025

ndgrigorian left a comment

Choose a reason for hiding this comment

oleksandr-pavlyk commented Jan 8, 2025 •

edited

Loading

github-actions bot commented Jan 8, 2025 •

edited

Loading

coveralls commented Jan 8, 2025 •

edited

Loading

ndgrigorian commented Jan 9, 2025 •

edited

Loading