Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization of PushRowPage for high number of cpu cores #11182

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

razdoburdin
Copy link
Contributor

@razdoburdin razdoburdin commented Jan 23, 2025

This PR adds optimal row-wise parallelization for PushRowPage.
The current realization is ineffective for multithread CPUs in case of a dataset has large number of rows and small number of columns. Current PR fixes this bottleneck. I observed up to 15% perf improvements for large datasets like airline on system with 2x56 cores CPU.

upg:
Some illustrations of changes in parallel processing:

We have a sparse input matrix. Each cell corresponds to a feature, and we process each cell individually, placing the results into the output array.
image

The original implementation processes data column by column. This approach is suboptimal when there are many CPU cores but only a few columns.
image

In the initial optimization, each thread processed a subset of rows but included all columns. This approach required significant memory overhead because buffers for intermediate results had to be used.
image

In the current version, I have reorganized the logic. Now, the input matrix is analyzed by the LoadBalance function. If a column is too large, it is split across multiple threads. This reduces buffer sizes since each thread now processes only a single feature.
image

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the optimization! Does the existing test cover both cases? If not could you please consider extracting the code for small unit tests? Also, would you like to share your benchmark results?

@razdoburdin
Copy link
Contributor Author

Thank you for the optimization! Does the existing test cover both cases? If not could you please consider extracting the code for small unit tests? Also, would you like to share your benchmark results?

Here is the speed-up in comparison with the current master branch.
image

As for unit-tests, it is not easy to cover such a case. I will try to figure it out.

@trivialfis
Copy link
Member

trivialfis commented Jan 24, 2025

One particular caution about similar optimizations, thread local buffers can cost significant amount of memory usage when the core count is high. For large server CPUs with hundreds of CPUs this can be severe. We partially reverted a similar optimization in the CPU predictor which uses thread local blocks due to this reason.

@trivialfis
Copy link
Member

Ref #6659

@razdoburdin razdoburdin marked this pull request as draft February 4, 2025 13:46
@razdoburdin
Copy link
Contributor Author

razdoburdin commented Feb 18, 2025

Sorry for a long delay. I have reorganized the optimization to reduce memory overhead (see the PR description for some details). I also added some tests to verify the changes.

@razdoburdin razdoburdin marked this pull request as ready for review February 18, 2025 08:31
@trivialfis
Copy link
Member

Thank you for continuing the optimization work. Could you please share the latest profiling result including performance and memory usage on the datasets that you are targeting?

@razdoburdin
Copy link
Contributor Author

Thank you for continuing the optimization work. Could you please share the latest profiling result including performance and memory usage on the datasets that you are targeting?

Here is the benchmarking for the last version of the PR. The numbers in the table >1 mean that this PR is faster than master branch, and consuming more memory.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants