Optimization of PushRowPage for high number of cpu cores #11182

razdoburdin · 2025-01-23T11:46:48Z

This PR adds optimal row-wise parallelization for PushRowPage.
The current realization is ineffective for multithread CPUs in case of a dataset has large number of rows and small number of columns. Current PR fixes this bottleneck. I observed up to 15% perf improvements for large datasets like airline on system with 2x56 cores CPU.

upg:
Some illustrations of changes in parallel processing:

We have a sparse input matrix. Each cell corresponds to a feature, and we process each cell individually, placing the results into the output array.

The original implementation processes data column by column. This approach is suboptimal when there are many CPU cores but only a few columns.

In the initial optimization, each thread processed a subset of rows but included all columns. This approach required significant memory overhead because buffers for intermediate results had to be used.

In the current version, I have reorganized the logic. Now, the input matrix is analyzed by the LoadBalance function. If a column is too large, it is split across multiple threads. This reduces buffer sizes since each thread now processes only a single feature.

trivialfis

Thank you for the optimization! Does the existing test cover both cases? If not could you please consider extracting the code for small unit tests? Also, would you like to share your benchmark results?

razdoburdin · 2025-01-24T11:30:24Z

Thank you for the optimization! Does the existing test cover both cases? If not could you please consider extracting the code for small unit tests? Also, would you like to share your benchmark results?

Here is the speed-up in comparison with the current master branch.

As for unit-tests, it is not easy to cover such a case. I will try to figure it out.

trivialfis · 2025-01-24T19:43:09Z

One particular caution about similar optimizations, thread local buffers can cost significant amount of memory usage when the core count is high. For large server CPUs with hundreds of CPUs this can be severe. We partially reverted a similar optimization in the CPU predictor which uses thread local blocks due to this reason.

trivialfis · 2025-01-24T19:50:37Z

Ref #6659

…om/razdoburdin/xgboost into dev/cpu/push_row_page_optimisation

razdoburdin · 2025-02-18T08:31:36Z

Sorry for a long delay. I have reorganized the optimization to reduce memory overhead (see the PR description for some details). I also added some tests to verify the changes.

trivialfis · 2025-02-20T07:15:42Z

Thank you for continuing the optimization work. Could you please share the latest profiling result including performance and memory usage on the datasets that you are targeting?

razdoburdin · 2025-02-21T13:55:50Z

Thank you for continuing the optimization work. Could you please share the latest profiling result including performance and memory usage on the datasets that you are targeting?

Here is the benchmarking for the last version of the PR. The numbers in the table >1 mean that this PR is faster than master branch, and consuming more memory.

Dmitry Razdoburdin added 3 commits January 23, 2025 03:37

add row-wise processing to PushRowPage

1780f5b

fix

922af51

clang tildy

77923f2

trivialfis reviewed Jan 24, 2025

View reviewed changes

Dmitry Razdoburdin and others added 4 commits February 4, 2025 05:21

reduce buffer size; add tests

ab31368

linting

836c768

Merge branch 'master' into dev/cpu/push_row_page_optimisation

0879f24

fix build

49e309c

razdoburdin marked this pull request as draft February 4, 2025 13:46

Dmitry Razdoburdin and others added 14 commits February 4, 2025 06:08

fix tests

d30e657

Merge branch 'dev/cpu/push_row_page_optimisation' of https://github.c…

add81f6

…om/razdoburdin/xgboost into dev/cpu/push_row_page_optimisation

fix tests with empty columns

e00bf4a

remove incorrect check

2c0e5dd

fix tilda warning and non-omp tests

7982415

fixes for tests

a0d5bd6

remove unpropper added tests

9280b5f

Merge branch 'master' into dev/cpu/push_row_page_optimisation

1519084

refactor the changes

8130d94

remove commented code

c09dc9e

return the minimal block size

ab2462a

param tuning

f2b33b8

reduce number of data in tests

79d9cb6

Merge branch 'master' into dev/cpu/push_row_page_optimisation

7aee6e1

razdoburdin marked this pull request as ready for review February 18, 2025 08:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization of PushRowPage for high number of cpu cores #11182

Optimization of PushRowPage for high number of cpu cores #11182

razdoburdin commented Jan 23, 2025 •

edited

Loading

trivialfis left a comment

razdoburdin commented Jan 24, 2025

trivialfis commented Jan 24, 2025 •

edited

Loading

trivialfis commented Jan 24, 2025

razdoburdin commented Feb 18, 2025 •

edited

Loading

trivialfis commented Feb 20, 2025

razdoburdin commented Feb 21, 2025

Optimization of PushRowPage for high number of cpu cores #11182

Are you sure you want to change the base?

Optimization of PushRowPage for high number of cpu cores #11182

Conversation

razdoburdin commented Jan 23, 2025 • edited Loading

trivialfis left a comment

Choose a reason for hiding this comment

razdoburdin commented Jan 24, 2025

trivialfis commented Jan 24, 2025 • edited Loading

trivialfis commented Jan 24, 2025

razdoburdin commented Feb 18, 2025 • edited Loading

trivialfis commented Feb 20, 2025

razdoburdin commented Feb 21, 2025

razdoburdin commented Jan 23, 2025 •

edited

Loading

trivialfis commented Jan 24, 2025 •

edited

Loading

razdoburdin commented Feb 18, 2025 •

edited

Loading