LST CPU Speedups by GNiendorf · Pull Request #245 · SegmentLinking/cmssw

GNiendorf · 2026-03-18T16:43:40Z

This PR Timing (CPU) - commit 2 (pre-checks, exact trig simplifications, and additional early exits)

This PR Timing (CPU) - commit 1 (reducing redundant memory loads)

Master Timing (CPU)

GNiendorf · 2026-03-18T17:31:20Z

run-ci: all

github-actions · 2026-03-18T17:58:10Z

The PR was built and ran successfully in standalone mode running on CPU. Here are some of the comparison plots.

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     29.0    323.1    245.5    138.1     48.6    695.7     10.9    116.6    119.7    208.9      0.1    1936.1    1211.5+/- 290.1     602.5   explicit[s=4] (target branch)
   avg     28.1    218.8    178.6    127.2     49.7    700.7     10.6    109.5     83.0    202.6      0.1    1708.9     980.1+/- 239.1     545.5   explicit[s=4] (this PR)

github-actions · 2026-03-18T19:22:27Z

The PR was built and ran successfully with CMSSW running on CPU. Here are some plots.

OOTB All Tracks

The full set of validation and comparison plots can be found here.

GNiendorf · 2026-03-19T10:57:53Z

run-ci: all
modifiers: gpu

github-actions · 2026-03-19T11:19:17Z

The PR was built and ran successfully in standalone mode running on GPU. Here are some of the comparison plots.

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     31.1      0.2      0.4      0.6      0.9      0.3      0.6      0.5      0.3      1.4      0.0      36.2       4.8+/-  2.7      36.2   explicit[s=1]
   avg      1.1      0.3      0.5      0.8      1.0      0.3      0.8      0.7      0.4      1.8      0.0       7.7       6.3+/-  2.8       4.0   explicit[s=2]
   avg      2.0      0.6      0.8      1.2      1.5      0.4      1.2      1.0      0.6      2.8      0.0      12.1       9.7+/-  3.5       3.2   explicit[s=4]
   avg      3.2      0.9      1.2      1.7      2.0      0.5      1.7      1.3      0.8      3.9      0.0      17.2      13.5+/-  4.3       3.0   explicit[s=6]
   avg      3.7      1.3      1.7      2.4      2.6      0.7      2.3      1.6      1.0      4.9      0.0      22.3      17.9+/-  4.6       2.9   explicit[s=8] (target branch)
   avg     31.1      0.2      0.4      0.6      0.9      0.3      0.6      0.5      0.3      1.4      0.0      36.2       4.8+/-  2.6      36.3   explicit[s=1]
   avg      1.3      0.3      0.5      0.7      1.0      0.3      0.8      0.7      0.4      1.8      0.0       7.9       6.4+/-  2.8       4.1   explicit[s=2]
   avg      2.2      0.6      0.8      1.2      1.5      0.4      1.2      1.0      0.6      2.8      0.0      12.2       9.6+/-  3.3       3.2   explicit[s=4]
   avg      3.0      0.9      1.2      1.7      2.1      0.5      1.7      1.3      0.8      3.8      0.0      17.0      13.5+/-  4.1       3.0   explicit[s=6]
   avg      3.6      1.3      1.7      2.3      2.6      0.7      2.2      1.7      1.0      5.0      0.0      22.2      18.0+/-  4.5       2.9   explicit[s=8] (this PR)

GNiendorf · 2026-03-19T11:22:05Z

@slava77 I think this PR is good to go. Represents most of the boiler-plate changes of the CPU speedups PR.

github-actions · 2026-03-19T12:35:39Z

The PR was built and ran successfully with CMSSW running on GPU. Here are some plots.

OOTB All Tracks

The full set of validation and comparison plots can be found here.

slava77 · 2026-03-19T13:29:14Z

the GPU variant should have one more significant digit in the component columns (the total can be still with .1.
I don't have a particluar preference for this PR or separate.

slava77

nice updates.
I think the comment cleanup in the MiniDoublet code is a bit too aggressive. While some removals may be clean for some tautological docs, quite a bit is going to lose clarity. Please recover

RecoTracker/LSTCore/src/alpaka/MiniDoublet.h

RecoTracker/LSTCore/src/alpaka/PixelTriplet.h

RecoTracker/LSTCore/src/alpaka/Segment.h

RecoTracker/LSTCore/src/alpaka/Triplet.h

GNiendorf · 2026-03-19T20:52:44Z

run-ci: all

GNiendorf · 2026-03-19T21:02:54Z

run-ci: all

GNiendorf · 2026-03-19T21:16:17Z

run-ci: all
modifiers: gpu

github-actions · 2026-03-19T21:23:16Z

The PR was built and ran successfully in standalone mode running on CPU. Here are some of the comparison plots.

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     28.2    324.2    243.0    136.6     47.8    698.4     10.9    114.7    118.8    208.7      0.1    1931.4    1204.8+/- 289.9     596.7   explicit[s=4] (target branch)
   avg     31.1    219.2    182.6    133.5     47.6    698.9     10.8    110.7     83.1    185.9      0.1    1703.5     973.4+/- 227.2     541.9   explicit[s=4] (this PR)

github-actions · 2026-03-19T22:18:47Z

The PR was built and ran successfully with CMSSW running on GPU. Here are some plots.

OOTB All Tracks

The full set of validation and comparison plots can be found here.

github-actions · 2026-03-19T22:23:37Z

The PR was built and ran successfully with CMSSW running on CPU. Here are some plots.

OOTB All Tracks

The full set of validation and comparison plots can be found here.

github-actions · 2026-03-19T22:39:10Z

The PR was built and ran successfully in standalone mode running on GPU. Here are some of the comparison plots.

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     32.5      0.2      0.4      0.6      0.9      0.3      0.6      0.5      0.3      1.4      0.0      37.6       4.8+/-  2.6      37.6   explicit[s=1]
   avg      1.1      0.4      0.5      0.8      1.0      0.3      0.8      0.7      0.4      1.8      0.0       7.8       6.4+/-  2.9       4.0   explicit[s=2]
   avg      1.8      0.6      0.8      1.1      1.5      0.4      1.2      1.0      0.6      2.8      0.0      11.9       9.7+/-  3.5       3.1   explicit[s=4]
   avg      2.6      0.9      1.3      1.7      2.0      0.5      1.7      1.2      0.8      3.9      0.0      16.6      13.5+/-  4.1       2.9   explicit[s=6]
   avg      3.4      1.3      1.7      2.3      2.6      0.7      2.3      1.6      1.0      5.0      0.0      21.9      17.8+/-  4.5       2.8   explicit[s=8] (target branch)
   avg     32.6      0.2      0.4      0.6      0.8      0.3      0.6      0.5      0.3      1.4      0.0      37.7       4.8+/-  2.5      37.7   explicit[s=1]
   avg      1.2      0.4      0.5      0.8      1.0      0.3      0.8      0.8      0.4      1.9      0.0       7.9       6.5+/-  2.8       4.1   explicit[s=2]
   avg      1.8      0.6      0.8      1.1      1.5      0.4      1.2      1.0      0.6      2.8      0.0      11.8       9.6+/-  3.5       3.1   explicit[s=4]
   avg      2.6      1.0      1.2      1.7      2.0      0.5      1.7      1.3      0.8      4.0      0.0      16.7      13.6+/-  4.0       2.9   explicit[s=6]
   avg      3.4      1.3      1.7      2.2      2.6      0.7      2.2      1.7      1.0      5.0      0.0      21.9      17.8+/-  4.5       2.8   explicit[s=8] (this PR)

GNiendorf · 2026-03-19T23:04:47Z

run-ci: all

GNiendorf · 2026-03-20T09:48:52Z

run-ci: all

github-actions · 2026-03-20T10:09:13Z

The PR was built and ran successfully in standalone mode running on CPU. Here are some of the comparison plots.

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     28.1    323.5    241.8    135.2     47.5    698.0     10.9    114.9    117.5    208.1      0.1    1925.6    1199.5+/- 290.5     596.4   explicit[s=4] (target branch)
   avg     31.1    103.1    122.0    122.9     49.6    684.4     11.0     40.2     69.3    212.6      0.1    1446.4     730.8+/- 189.1     479.6   explicit[s=4] (this PR)

github-actions · 2026-03-20T11:30:52Z

The PR was built and ran successfully with CMSSW running on CPU. Here are some plots.

OOTB All Tracks

The full set of validation and comparison plots can be found here.

slava77 · 2026-03-20T12:38:29Z

it would be good to get some rough accounting of the speedups bu the category/type of changes (from a rough look at the code: early loop exit, moving variables early in nested loops, moving compute closest to use after the cuts, dPhi xy1,2, module data and hit data pre-loading to structs)

while some changes are rather clear expected speedups, some other are less so.
Ideally, before squashing everything to a single commit it would've been better to trace a sequence of improvements in finer chunks.

slava77 · 2026-03-20T14:52:31Z

RecoTracker/LSTCore/src/alpaka/MiniDoublet.h

+
+  // Pre-computed module-constant data for MiniDoublet kernels.
+  // Populated once per module to avoid redundant SoA loads in the inner hit-pair loop.
+  struct ModuleMDData {


how large is the overlap in values between the MD-related and other module data?
It looks like this is a case for using AoS (plain old array): 1-2 cache line fetches will get the full MD data, compared to our current SoA mostly unused reads for modules

Also, consider to reorder by size at least partially to avoid padding

GNiendorf force-pushed the cpu_speedups_hoist branch from f5fbe61 to 83f2297 Compare March 18, 2026 17:30

GNiendorf marked this pull request as ready for review March 18, 2026 17:37

slava77 reviewed Mar 19, 2026

View reviewed changes

GNiendorf force-pushed the cpu_speedups_hoist branch 2 times, most recently from 9aad224 to 727bac8 Compare March 19, 2026 20:50

remove redundant memory loads

2375562

GNiendorf force-pushed the cpu_speedups_hoist branch from 727bac8 to 2375562 Compare March 19, 2026 20:59

GNiendorf force-pushed the cpu_speedups_hoist branch from a2305fb to df0990a Compare March 19, 2026 23:24

GNiendorf changed the title ~~Remove Redundant Memory Loads~~ CPU Optimizations Mar 19, 2026

GNiendorf changed the title ~~CPU Optimizations~~ LST CPU Speedups Mar 19, 2026

GNiendorf force-pushed the cpu_speedups_hoist branch from df0990a to 6ba26d8 Compare March 20, 2026 01:17

pre-checks, exact trig simplifications, and additional early exits

912b6d9

GNiendorf force-pushed the cpu_speedups_hoist branch from 6ba26d8 to 912b6d9 Compare March 20, 2026 09:45

Conversation

GNiendorf commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GNiendorf commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

GNiendorf commented Mar 19, 2026

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

GNiendorf commented Mar 19, 2026

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

slava77 commented Mar 19, 2026

Uh oh!

slava77 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GNiendorf commented Mar 19, 2026

Uh oh!

GNiendorf commented Mar 19, 2026

Uh oh!

GNiendorf commented Mar 19, 2026

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

GNiendorf commented Mar 19, 2026

Uh oh!

GNiendorf commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

slava77 commented Mar 20, 2026

Uh oh!

slava77 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GNiendorf commented Mar 18, 2026 •

edited

Loading