Add reduced memory runtime toggle for LST by GNiendorf · Pull Request #242 · SegmentLinking/cmssw

GNiendorf · 2026-03-08T14:46:21Z

This PR adds a reduceMem runtime flag that enables exact buffer sizing for all LST objects (MD, LS, T3, T5, T4) in each counting kernel, reducing average memory usage from ~97 MB to ~33 MB per event. When the flag is off (default), behavior is identical to master with negligible timing overhead, as the new kernel launches are gated behind host-side if (reduceMem_) checks and use separate kernel structs. The flag is exposed as --reduce-mem in standalone and as a reduceMem config parameter in the CMSSW EDProducer.

GNiendorf · 2026-03-08T14:46:55Z

run-ci: all

GNiendorf · 2026-03-08T16:46:01Z

run-ci: all

github-actions · 2026-03-08T17:06:09Z

The PR was built and ran successfully in standalone mode running on CPU. Here are some of the comparison plots.

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     28.1    325.2    243.6    134.3     51.5    675.0     11.0    114.3    118.4    190.7      0.1    1892.1    1189.0+/- 284.7     588.8   explicit[s=4] (target branch)
   avg     28.1    328.5    244.6    142.2     48.2    683.7     11.0    116.0    118.9    190.5      0.1    1911.7    1199.9+/- 287.2     593.2   explicit[s=4] (this PR)

GNiendorf · 2026-03-08T17:07:06Z

Nice, timing is unchanged with the runtime toggle turned off. No noticeable overhead.

github-actions · 2026-03-08T18:43:12Z

The PR was built and ran successfully with CMSSW running on CPU. Here are some plots.

OOTB All Tracks

The full set of validation and comparison plots can be found here.

slava77 · 2026-03-10T22:12:07Z

RecoTracker/LSTCore/src/alpaka/Segment.h

    }
  };

+  // Reduced-memory version of CountMiniDoubletConnections: runs full segment algorithm


general question: how much code is replicated and can it be avoided?
perhaps a template can be made

Yeah, one downside of making separate kernels to reduce overhead of the toggle is code duplication. I will see if I can reduce it.

slava77 · 2026-03-12T12:56:17Z

for completeness (and since I think you already have the measurements from your talk) please post a table of before/after memory use as well as the CPU and GPU timing with this toggle enabled in the PR description; can even be just a link to the slides.

slava77 · 2026-03-12T12:50:57Z

RecoTracker/LST/plugins/alpaka/LSTProducer.cc

          clustSizeCut_(static_cast<uint16_t>(config.getParameter<uint32_t>("clustSizeCut"))),
          nopLSDupClean_(config.getParameter<bool>("nopLSDupClean")),
          tcpLSTriplets_(config.getParameter<bool>("tcpLSTriplets")),
+          reduceMem_(config.getParameter<bool>("reduceMem")),


something like fullPrecomputeMemSlots or similarly expressive (like reduceMemByFullPrecompute). Just "reduceMem" seems more unclear why would it ever be "false".
Adding a comment in the fillDescrptions can useful as well

slava77 · 2026-03-12T12:53:02Z

RecoTracker/LSTCore/interface/alpaka/LST.h

             bool no_pls_dupclean,
-             bool tc_pls_triplets);
+             bool tc_pls_triplets,
+             bool reduceMem = false);


Suggested change

bool reduceMem = false);

bool reduceMem);

better be explicit

slava77 · 2026-03-12T13:01:21Z

RecoTracker/LSTCore/src/alpaka/LSTEvent.dev.cc

+      auto dst_view_miniDoubletModuleOccupancy =
+          cms::alpakatools::make_device_view(queue_, rangesOccupancy.miniDoubletModuleOccupancy()[nLowerModules_]);
+      alpaka::memcpy(queue_, dst_view_miniDoubletModuleOccupancy, pixelMaxMDs_buf_h);
+
+      auto dst_view_miniDoubletModuleOccupancyPix =
+          cms::alpakatools::make_device_view(queue_, rangesOccupancy.miniDoubletModuleOccupancy()[pixelModuleIndex_]);
+      alpaka::memcpy(queue_, dst_view_miniDoubletModuleOccupancyPix, pixelMaxMDs_buf_h);


isn't the first redundant? Meaning that nLowerModules_ is equal to pixelModuleIndex_ ; IIRC the latter was introduced to mean in the name what it is rather than remember what it's supposed to be (the former)

slava77 · 2026-03-12T13:03:52Z

RecoTracker/LSTCore/src/alpaka/LSTEvent.dev.cc

+      auto dst_view_miniDoubletModuleOccupancy =
+          cms::alpakatools::make_device_view(queue_, rangesOccupancy.miniDoubletModuleOccupancy()[nLowerModules_]);
+      alpaka::memcpy(queue_, dst_view_miniDoubletModuleOccupancy, pixelMaxMDs_buf_h);
+
+      auto dst_view_miniDoubletModuleOccupancyPix =
+          cms::alpakatools::make_device_view(queue_, rangesOccupancy.miniDoubletModuleOccupancy()[pixelModuleIndex_]);


isn't the first redundant?

slava77 · 2026-03-12T13:06:12Z

RecoTracker/LSTCore/src/alpaka/LSTEvent.dev.cc

+      constexpr int threadsPerBlockY = 16;
+      auto const countMiniDoublets_workDiv =
+          cms::alpakatools::make_workdiv<Acc2D>({nLowerModules_ / threadsPerBlockY, 1}, {threadsPerBlockY, 32});


if these have to match another kernel call where creation is done without precompute, perhaps add a comment

slava77 · 2026-03-12T13:26:23Z

RecoTracker/LSTCore/src/alpaka/LSTEvent.dev.cc

+    if (reduceMem_) {
+      alpaka::exec<Acc3D>(queue_,
+                          countMDConn_wd,
+                          CountMiniDoubletConnectionsDense{},
+                          modules_.const_view().modules(),
+                          miniDoubletsDC_->view().miniDoublets(),
+                          miniDoubletsDC_->const_view().miniDoubletsOccupancy(),
+                          rangesDC_->const_view(),
+                          ptCut_);
+    } else {
+      alpaka::exec<Acc3D>(queue_,
+                          countMDConn_wd,
+                          CountMiniDoubletConnections{},
+                          modules_.const_view().modules(),
+                          miniDoubletsDC_->view().miniDoublets(),
+                          miniDoubletsDC_->const_view().miniDoubletsOccupancy(),
+                          rangesDC_->const_view(),
+                          ptCut_);
+    }


Suggested change

if (reduceMem_) {

alpaka::exec<Acc3D>(queue_,

countMDConn_wd,

CountMiniDoubletConnectionsDense{},

modules_.const_view().modules(),

miniDoubletsDC_->view().miniDoublets(),

miniDoubletsDC_->const_view().miniDoubletsOccupancy(),

rangesDC_->const_view(),

ptCut_);

} else {

alpaka::exec<Acc3D>(queue_,

countMDConn_wd,

CountMiniDoubletConnections{},

modules_.const_view().modules(),

miniDoubletsDC_->view().miniDoublets(),

miniDoubletsDC_->const_view().miniDoubletsOccupancy(),

rangesDC_->const_view(),

ptCut_);

}

auto executeCountMDConn = [&](auto kernel) {

alpaka::exec<Acc3D>(queue_,

countMDConn_wd,

kernel,

modules_.const_view().modules(),

miniDoubletsDC_->view().miniDoublets(),

miniDoubletsDC_->const_view().miniDoubletsOccupancy(),

rangesDC_->const_view(),

ptCut_);

};

if (reduceMem_) {

executeCountMDConn(CountMiniDoubletConnectionsDense{});

} else {

executeCountMDConn(CountMiniDoubletConnections{});

}

I got this from the copilot, didn't try to compile.
From an earlier comment the other day, if the kernels themselves have a significant similarity, perhaps template those with a bool flag

slava77 · 2026-03-12T13:28:27Z

RecoTracker/LSTCore/src/alpaka/MiniDoublet.h

+  // Reduced-memory version of CreateMDArrayRangesGPU: reads pre-computed exact counts
+  // from CountMiniDoublets instead of using matrix-based caps.
+  // Only launched when reduceMem is enabled.
+  struct CreateMDArrayRangesReducedMem {


is Dense suffix going to be appropriate? just trying to reduce a set of naming patterns

slava77 · 2026-03-12T13:40:32Z

RecoTracker/LSTCore/src/alpaka/Quintuplet.h

+#ifdef WARNINGS
+                printf("Quintuplet excess alert! Module index = %d, Occupancy = %d\n",
+                       lowerModule1,
+                       totOccupancyQuintuplets);
+#endif
+              } else {
+                int quintupletModuleIndex = alpaka::atomicAdd(
+                    acc, &quintupletsOccupancy.nQuintuplets()[lowerModule1], 1u, alpaka::hierarchy::Threads{});
+                if (ranges.quintupletModuleIndices()[lowerModule1] == -1) {
+#ifdef WARNINGS
+                  printf("Quintuplets : no memory for module at module index = %d\n", lowerModule1);
+#endif


(perhaps costly); aren't these "warnings" supposed to be asserts for the Dense case? or did I miss the logic

SegmentLinking deleted a comment from github-actions bot Mar 8, 2026

Add reduced memory runtime toggle for LST

1221d59

GNiendorf force-pushed the min_mem branch from 987889a to 1221d59 Compare March 8, 2026 16:45

GNiendorf marked this pull request as ready for review March 9, 2026 09:49

slava77 reviewed Mar 10, 2026

View reviewed changes

slava77 reviewed Mar 12, 2026

View reviewed changes

Conversation

GNiendorf commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GNiendorf commented Mar 8, 2026

Uh oh!

GNiendorf commented Mar 8, 2026

Uh oh!

github-actions bot commented Mar 8, 2026

Uh oh!

GNiendorf commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 8, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slava77 commented Mar 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GNiendorf commented Mar 8, 2026 •

edited

Loading

GNiendorf commented Mar 8, 2026 •

edited

Loading