Skip to content

Add reduced memory runtime toggle for LST#242

Open
GNiendorf wants to merge 1 commit intomasterfrom
min_mem
Open

Add reduced memory runtime toggle for LST#242
GNiendorf wants to merge 1 commit intomasterfrom
min_mem

Conversation

@GNiendorf
Copy link
Member

@GNiendorf GNiendorf commented Mar 8, 2026

This PR adds a reduceMem runtime flag that enables exact buffer sizing for all LST objects (MD, LS, T3, T5, T4) in each counting kernel, reducing average memory usage from ~97 MB to ~33 MB per event. When the flag is off (default), behavior is identical to master with negligible timing overhead, as the new kernel launches are gated behind host-side if (reduceMem_) checks and use separate kernel structs. The flag is exposed as --reduce-mem in standalone and as a reduceMem config parameter in the CMSSW EDProducer.

@GNiendorf
Copy link
Member Author

run-ci: all

@SegmentLinking SegmentLinking deleted a comment from github-actions bot Mar 8, 2026
@SegmentLinking SegmentLinking deleted a comment from github-actions bot Mar 8, 2026
@GNiendorf
Copy link
Member Author

run-ci: all

@github-actions
Copy link

github-actions bot commented Mar 8, 2026

The PR was built and ran successfully in standalone mode running on CPU. Here are some of the comparison plots.

Efficiency vs pT comparison Efficiency vs eta comparison
Fake rate vs pT comparison Fake rate vs eta comparison
Duplicate rate vs pT comparison Duplicate rate vs eta comparison

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     28.1    325.2    243.6    134.3     51.5    675.0     11.0    114.3    118.4    190.7      0.1    1892.1    1189.0+/- 284.7     588.8   explicit[s=4] (target branch)
   avg     28.1    328.5    244.6    142.2     48.2    683.7     11.0    116.0    118.9    190.5      0.1    1911.7    1199.9+/- 287.2     593.2   explicit[s=4] (this PR)

@GNiendorf
Copy link
Member Author

GNiendorf commented Mar 8, 2026

Nice, timing is unchanged with the runtime toggle turned off. No noticeable overhead.

@github-actions
Copy link

github-actions bot commented Mar 8, 2026

The PR was built and ran successfully with CMSSW running on CPU. Here are some plots.

OOTB All Tracks
Efficiency and fake rate vs pT, eta, and phi

The full set of validation and comparison plots can be found here.

@GNiendorf GNiendorf marked this pull request as ready for review March 9, 2026 09:49
}
};

// Reduced-memory version of CountMiniDoubletConnections: runs full segment algorithm
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

general question: how much code is replicated and can it be avoided?
perhaps a template can be made

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, one downside of making separate kernels to reduce overhead of the toggle is code duplication. I will see if I can reduce it.

@slava77
Copy link

slava77 commented Mar 12, 2026

for completeness (and since I think you already have the measurements from your talk) please post a table of before/after memory use as well as the CPU and GPU timing with this toggle enabled in the PR description; can even be just a link to the slides.

clustSizeCut_(static_cast<uint16_t>(config.getParameter<uint32_t>("clustSizeCut"))),
nopLSDupClean_(config.getParameter<bool>("nopLSDupClean")),
tcpLSTriplets_(config.getParameter<bool>("tcpLSTriplets")),
reduceMem_(config.getParameter<bool>("reduceMem")),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like fullPrecomputeMemSlots or similarly expressive (like reduceMemByFullPrecompute). Just "reduceMem" seems more unclear why would it ever be "false".
Adding a comment in the fillDescrptions can useful as well

bool no_pls_dupclean,
bool tc_pls_triplets);
bool tc_pls_triplets,
bool reduceMem = false);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool reduceMem = false);
bool reduceMem);

better be explicit

Comment on lines +203 to +209
auto dst_view_miniDoubletModuleOccupancy =
cms::alpakatools::make_device_view(queue_, rangesOccupancy.miniDoubletModuleOccupancy()[nLowerModules_]);
alpaka::memcpy(queue_, dst_view_miniDoubletModuleOccupancy, pixelMaxMDs_buf_h);

auto dst_view_miniDoubletModuleOccupancyPix =
cms::alpakatools::make_device_view(queue_, rangesOccupancy.miniDoubletModuleOccupancy()[pixelModuleIndex_]);
alpaka::memcpy(queue_, dst_view_miniDoubletModuleOccupancyPix, pixelMaxMDs_buf_h);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't the first redundant? Meaning that nLowerModules_ is equal to pixelModuleIndex_ ; IIRC the latter was introduced to mean in the name what it is rather than remember what it's supposed to be (the former)

Comment on lines +234 to +239
auto dst_view_miniDoubletModuleOccupancy =
cms::alpakatools::make_device_view(queue_, rangesOccupancy.miniDoubletModuleOccupancy()[nLowerModules_]);
alpaka::memcpy(queue_, dst_view_miniDoubletModuleOccupancy, pixelMaxMDs_buf_h);

auto dst_view_miniDoubletModuleOccupancyPix =
cms::alpakatools::make_device_view(queue_, rangesOccupancy.miniDoubletModuleOccupancy()[pixelModuleIndex_]);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't the first redundant?

Comment on lines +211 to +213
constexpr int threadsPerBlockY = 16;
auto const countMiniDoublets_workDiv =
cms::alpakatools::make_workdiv<Acc2D>({nLowerModules_ / threadsPerBlockY, 1}, {threadsPerBlockY, 32});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if these have to match another kernel call where creation is done without precompute, perhaps add a comment

Comment on lines +327 to +345
if (reduceMem_) {
alpaka::exec<Acc3D>(queue_,
countMDConn_wd,
CountMiniDoubletConnectionsDense{},
modules_.const_view().modules(),
miniDoubletsDC_->view().miniDoublets(),
miniDoubletsDC_->const_view().miniDoubletsOccupancy(),
rangesDC_->const_view(),
ptCut_);
} else {
alpaka::exec<Acc3D>(queue_,
countMDConn_wd,
CountMiniDoubletConnections{},
modules_.const_view().modules(),
miniDoubletsDC_->view().miniDoublets(),
miniDoubletsDC_->const_view().miniDoubletsOccupancy(),
rangesDC_->const_view(),
ptCut_);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (reduceMem_) {
alpaka::exec<Acc3D>(queue_,
countMDConn_wd,
CountMiniDoubletConnectionsDense{},
modules_.const_view().modules(),
miniDoubletsDC_->view().miniDoublets(),
miniDoubletsDC_->const_view().miniDoubletsOccupancy(),
rangesDC_->const_view(),
ptCut_);
} else {
alpaka::exec<Acc3D>(queue_,
countMDConn_wd,
CountMiniDoubletConnections{},
modules_.const_view().modules(),
miniDoubletsDC_->view().miniDoublets(),
miniDoubletsDC_->const_view().miniDoubletsOccupancy(),
rangesDC_->const_view(),
ptCut_);
}
auto executeCountMDConn = [&](auto kernel) {
alpaka::exec<Acc3D>(queue_,
countMDConn_wd,
kernel,
modules_.const_view().modules(),
miniDoubletsDC_->view().miniDoublets(),
miniDoubletsDC_->const_view().miniDoubletsOccupancy(),
rangesDC_->const_view(),
ptCut_);
};
if (reduceMem_) {
executeCountMDConn(CountMiniDoubletConnectionsDense{});
} else {
executeCountMDConn(CountMiniDoubletConnections{});
}

I got this from the copilot, didn't try to compile.
From an earlier comment the other day, if the kernels themselves have a significant similarity, perhaps template those with a bool flag

// Reduced-memory version of CreateMDArrayRangesGPU: reads pre-computed exact counts
// from CountMiniDoublets instead of using matrix-based caps.
// Only launched when reduceMem is enabled.
struct CreateMDArrayRangesReducedMem {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is Dense suffix going to be appropriate? just trying to reduce a set of naming patterns

Comment on lines +1775 to +1786
#ifdef WARNINGS
printf("Quintuplet excess alert! Module index = %d, Occupancy = %d\n",
lowerModule1,
totOccupancyQuintuplets);
#endif
} else {
int quintupletModuleIndex = alpaka::atomicAdd(
acc, &quintupletsOccupancy.nQuintuplets()[lowerModule1], 1u, alpaka::hierarchy::Threads{});
if (ranges.quintupletModuleIndices()[lowerModule1] == -1) {
#ifdef WARNINGS
printf("Quintuplets : no memory for module at module index = %d\n", lowerModule1);
#endif
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(perhaps costly); aren't these "warnings" supposed to be asserts for the Dense case? or did I miss the logic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants