Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Technical debt changes #1957

Merged
merged 11 commits into from
Jan 10, 2025
Merged

Technical debt changes #1957

merged 11 commits into from
Jan 10, 2025

Conversation

oleksandr-pavlyk
Copy link
Collaborator

@oleksandr-pavlyk oleksandr-pavlyk commented Jan 8, 2025

This PR checks in some technical debt changes.

  • Do not user anonymous namespace in C++ header files.

Rationale: Entities defined in anonymous namespace have internal linkage, that is, they are generated in every translation units that includes the header file bloating the size of compiled binary, as well increasing compilation time

  • Move inline kernel submissions for tasks that do not depend on all template parameters of the function to a separate function.

Rationale: Inline use requires kernel names to be distinct for function different instantiations, producing multiple copies of identical kernels, bloating the binary.

This change alone shed 13MB off the size of _tensor_accumulation_impl native extension (from 49MB to 36MB).


  • Have you provided a meaningful PR description?
  • Have you added a test, reproducer or referred to an issue with a reproducer?
  • Have you tested your changes locally for CPU and GPU devices?
  • Have you made sure that new changes do not introduce compiler warnings?
  • Have you checked performance impact of proposed changes?
  • Have you added documentation for your changes, if necessary?
  • Have you added your changes to the changelog?
  • If this PR is a work in progress, are you opening the PR as a draft?

…eaders

Using anonymous namespace in header files is against best C++ practices,
since entities in anonymous namespace have internal linkage, and every
translation unit that includes the header file would have its own copy,
increasing compilation time and bloating the binary size.
Using anonymous namespace in headers is against best C++ practices
due to internal linkage of entities in that namespace.
Avoid using comparator type to form kernel name types for
iota and map_back kernels (as they do not depedent on
comparator). This reduces the number of kernels generated
during instantiation of template implementation functions.
Since vectors `ptrs` and `dels` are no longer needed after host_task
submission, we might as well avoid the copying and use std::move
in lambda capture initialization.

Also renamed `Args` template pack to `UniquePtrTs`, and `args`
template argument to `unique_ptrs`.

Added comments next to each include to note the entity which requires
it.
…late params

Removed unncessary template parameters from kernel names submitted by these
functions. As a consequence, the size of `_tensor_accumulation_impl` shared
object reduced from 49'360'152 bytes to 36'422'888, that is, by almost 13MB.
@oleksandr-pavlyk
Copy link
Collaborator Author

@AlexanderKalistratov I pushed the changes you suggested to make in async_smart_free in 80f288c

Copy link

github-actions bot commented Jan 8, 2025

Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞

Copy link

github-actions bot commented Jan 8, 2025

Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_415 ran successfully.
Passed: 894
Failed: 2
Skipped: 118

@coveralls
Copy link
Collaborator

coveralls commented Jan 8, 2025

Coverage Status

coverage: 87.715% (-0.001%) from 87.716%
when pulling 065413e on technical-debt-changes
into a7ca491 on master.

@ndgrigorian
Copy link
Collaborator

ndgrigorian commented Jan 9, 2025

The changes in this PR have also reduced the build time of dpctl, by about 10 minutes for me

this branch

34m33.407s

master

45m25.987s

Direct calls to host_task to asynchronously deallocate
USM temporary are replaced with call to async_smart_free
which submits the host_task for us and transfers allocation
ownership from smart pointer to the host task.
…_generic_impl to take packed shape/strides as const pointer
The unique_ptr owns the allocation ensuring no leaks during
exception handling.

This also allows async_smart_free to be used to schedule
asynchronous deallocation of USM temporaries.
Copy link

github-actions bot commented Jan 9, 2025

Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_420 ran successfully.
Passed: 894
Failed: 2
Skipped: 118

Doing so reduces the binary size. Previously, '_tensor_sorting_impl'
module has size 22'448'920 bytes, and '_tensor_sorting_radix_impl'
has size 31'927'256 bytes.

Total size was 54'376'176 bytes.

After this change, the total size of the new '_tensor_sorting_impl'
is 49'790'872, which is about 4Mb of savings.
Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_421 ran successfully.
Passed: 893
Failed: 3
Skipped: 118

Copy link
Collaborator

@ndgrigorian ndgrigorian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM overall

Thank you @oleksandr-pavlyk

@oleksandr-pavlyk oleksandr-pavlyk merged commit 9f8f90b into master Jan 10, 2025
60 of 61 checks passed
@oleksandr-pavlyk oleksandr-pavlyk deleted the technical-debt-changes branch January 10, 2025 12:38
antonwolfy added a commit to IntelPython/dpnp that referenced this pull request Jan 11, 2025
The PR proposes to implement fixes/improvements into the public CI:
- run `pre-commit` and `building docs` workflows on Ubuntu-22.04
(`ubuntu-latest` now refers to Ubuntu-24.04 where the workflows fail as
for now)
- use `fail-fast: false` rather than `continue-on-error: false` in
workflow with night tests run (some jobs in strategy matrix may fail due
to mamba/conda issues)
_Note, the workflow will be reported as failed if any job fails._
- trigger rerun tests step on failure of `run_tests` step only in night
tests workflow
- trigger resetup minoconda step on failure in night tests and conda
package workflows
- build name of running job based on python and OS from strategy matrix
- disable night tests workflow to be run in forked repositories

Also the PR includes adaptation to the recent interface changes in dpctl
implemented in
[gh-1957](IntelPython/dpctl#1957).
github-actions bot added a commit to IntelPython/dpnp that referenced this pull request Jan 11, 2025
The PR proposes to implement fixes/improvements into the public CI:
- run `pre-commit` and `building docs` workflows on Ubuntu-22.04
(`ubuntu-latest` now refers to Ubuntu-24.04 where the workflows fail as
for now)
- use `fail-fast: false` rather than `continue-on-error: false` in
workflow with night tests run (some jobs in strategy matrix may fail due
to mamba/conda issues)
_Note, the workflow will be reported as failed if any job fails._
- trigger rerun tests step on failure of `run_tests` step only in night
tests workflow
- trigger resetup minoconda step on failure in night tests and conda
package workflows
- build name of running job based on python and OS from strategy matrix
- disable night tests workflow to be run in forked repositories

Also the PR includes adaptation to the recent interface changes in dpctl
implemented in
[gh-1957](IntelPython/dpctl#1957). 303a203
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants