-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization of PushRowPage for high number of cpu cores #11182
base: master
Are you sure you want to change the base?
Optimization of PushRowPage for high number of cpu cores #11182
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the optimization! Does the existing test cover both cases? If not could you please consider extracting the code for small unit tests? Also, would you like to share your benchmark results?
One particular caution about similar optimizations, thread local buffers can cost significant amount of memory usage when the core count is high. For large server CPUs with hundreds of CPUs this can be severe. We partially reverted a similar optimization in the CPU predictor which uses thread local blocks due to this reason. |
Ref #6659 |
…om/razdoburdin/xgboost into dev/cpu/push_row_page_optimisation
Sorry for a long delay. I have reorganized the optimization to reduce memory overhead (see the PR description for some details). I also added some tests to verify the changes. |
Thank you for continuing the optimization work. Could you please share the latest profiling result including performance and memory usage on the datasets that you are targeting? |
This PR adds optimal row-wise parallelization for
PushRowPage
.The current realization is ineffective for multithread CPUs in case of a dataset has large number of rows and small number of columns. Current PR fixes this bottleneck. I observed up to 15% perf improvements for large datasets like airline on system with 2x56 cores CPU.
upg:
Some illustrations of changes in parallel processing:
We have a sparse input matrix. Each cell corresponds to a feature, and we process each cell individually, placing the results into the output array.

The original implementation processes data column by column. This approach is suboptimal when there are many CPU cores but only a few columns.

In the initial optimization, each thread processed a subset of rows but included all columns. This approach required significant memory overhead because buffers for intermediate results had to be used.

In the current version, I have reorganized the logic. Now, the input matrix is analyzed by the

LoadBalance
function. If a column is too large, it is split across multiple threads. This reduces buffer sizes since each thread now processes only a single feature.