Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LST followups: better work divisions, concrete kernel dimension, some cleanup and fixes #47084

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

ariostas
Copy link
Contributor

This PR addresses some of the LST followups that we have listed in #46746.

Here is the list of fixes/changes:

  • Better work division: we switched to using cms::alpakatools::makeworkdiv (instead of our custom createWorkDiv) and we now use cms::alpakatools::uniform_elements for kernel loops.
  • We switched to explicitly specifying kernel dimensions instead of using templated types.
  • Started removal of kVerticalModuleSlope (previously named lst_INF). We're doing this in two steps instead of one since the data files also need to be updated. We ensure a smooth transition by first supporting both options and later removing the legacy one.
  • We fixed some issues with our includes and with an overflow that was sometimes happening.

c.c. @slava77 @VourMa

@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 10, 2025

cms-bot internal usage

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @ariostas for master.

It involves the following packages:

  • RecoTracker/LSTCore (reconstruction)

@cmsbuild, @jfernan2, @mandrenguyen can you please review it and eventually sign? Thanks.
@GiacomoSguazzoni, @VinInn, @VourMa, @dgulhan, @felicepantaleo, @gpetruc, @missirol, @mmusich, @mtosi, @rovere this is something you requested to watch as well.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@ariostas
Copy link
Contributor Author

Tagging @fwyzard since most (if not all) of the comments addressed were his

@slava77
Copy link
Contributor

slava77 commented Jan 10, 2025

test parameters:

  • enable_tests = gpu
  • workflows_gpu = 29634.704,29834.704
  • workflows = 29634.703,29834.703
  • relvals_opt = -w upgrade,standard
  • relvals_opt_gpu = -w upgrade,standard

@slava77
Copy link
Contributor

slava77 commented Jan 10, 2025

@cmsbuild please test

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests RelVals-GPU
Size: This PR adds an extra 104KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-1eb2fd/43723/summary.html
COMMIT: 1a27b2a
CMSSW: CMSSW_15_0_X_2025-01-10-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/47084/43723/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found 1 errors in the following unit tests:

---> test test-das-selected-lumis had ERRORS

RelVals-GPU

  • 29834.70429834.704_TTbar_14TeV+Run4D110PU_lstOnGPUIters01TrackingOnly/step3_TTbar_14TeV+Run4D110PU_lstOnGPUIters01TrackingOnly.log

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 8 differences found in the comparisons
  • DQMHistoTests: Total files compared: 52
  • DQMHistoTests: Total histograms compared: 3996179
  • DQMHistoTests: Total failures: 64
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3996095
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 51 files compared)
  • Checked 226 log files, 195 edm output root files, 52 DQM output files
  • TriggerResults: no differences found

@slava77
Copy link
Contributor

slava77 commented Jan 10, 2025

29834.70429834.704_TTbar_14TeV+Run4D110PU_lstOnGPUIters01TrackingOnly/step3_TTbar_14TeV+Run4D110PU_lstOnGPUIters01TrackingOnly.log

there are a bunch of errors like

alpaka/event/EventUniformCudaHipRt.hpp(66) 
'TApi::eventDestroy(m_UniformCudaHipEvent)' returned error  : 
'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!`

the same workflow step3 in the baseline ran OK. So, the crash seems related to this PR.

@jfernan2
Copy link
Contributor

assign heterogeneous

@cmsbuild
Copy link
Contributor

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

RecoTracker/LSTCore/interface/Circle.h Outdated Show resolved Hide resolved
RecoTracker/LSTCore/interface/alpaka/Common.h Outdated Show resolved Hide resolved
RecoTracker/LSTCore/src/alpaka/PixelTriplet.h Outdated Show resolved Hide resolved
@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

Pull request #47084 was updated. @cmsbuild, @fwyzard, @jfernan2, @makortel, @mandrenguyen can you please check and sign again.

@slava77
Copy link
Contributor

slava77 commented Jan 17, 2025

@cmsbuild please test

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 100KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-1eb2fd/43830/summary.html
COMMIT: baa91b3
CMSSW: CMSSW_15_0_X_2025-01-17-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/47084/43830/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

GPU Comparison Summary

Summary:

@jfernan2
Copy link
Contributor

+1

@slava77
Copy link
Contributor

slava77 commented Jan 23, 2025

@cms-sw/heterogeneous-l2
(6 days after the last update resolving available comments)
please clarify on the status of your review or the expected signoff time.
Thank you.

@makortel
Copy link
Contributor

Looks ok to me. @fwyzard Are you planning to take a look, or shall we just sign?

@fwyzard
Copy link
Contributor

fwyzard commented Jan 27, 2025

I can have a look in the coming days, but if this is urgent for any reason go ahead and sign it, and I will still have a look after the fact.

@fwyzard
Copy link
Contributor

fwyzard commented Jan 27, 2025

hold

@cms-sw/heterogeneous-l2
(6 days after the last update resolving available comments)
please clarify on the status of your review or the expected signoff time.
Thank you.

Actually, you know what ?
I will review it when I have the time.

@cmsbuild
Copy link
Contributor

Pull request has been put on hold by @fwyzard
They need to issue an unhold command to remove the hold state or L1 can unhold it for all

@cmsbuild cmsbuild added the hold label Jan 27, 2025
Vec3D const blocksPerGrid_crossCleanpT3{1, 4, 20};
WorkDiv3D const crossCleanpT3_workDiv =
createWorkDiv(blocksPerGrid_crossCleanpT3, threadsPerBlock_crossCleanpT3, elementsPerThread);
auto const crossCleanpT3_workDiv = cms::alpakatools::make_workdiv<Acc2D>({20, 4}, {64, 16});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the X and Y values in the ranges are inverted with respect to before - is it intended ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there were a few places where I flipped the order so that the loops are nested in the recommended order.

Vec3D const blocksPerGrid_crossCleanpLS{1, 4, 20};
WorkDiv3D const crossCleanpLS_workDiv =
createWorkDiv(blocksPerGrid_crossCleanpLS, threadsPerBlock_crossCleanpLS, elementsPerThread);
auto const crossCleanpLS_workDiv = cms::alpakatools::make_workdiv<Acc2D>({20, 4}, {32, 16});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here (OK, so it's probably intended).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

if (slope ==
kVerticalModuleSlope) // Designated for tilted module when the slope is infinity (module lying along y-axis)
if (slope == kVerticalModuleSlope ||
edm::isNotFinite(slope)) // Designated for tilted module when the slope is infinity (module lying along y-axis)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@makortel do you know if edm::isFinite/edm::isNotFinite is guaranteed to in device code ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically they could presently work: the functions are constexpr and presently use union for type punning (that is strictly speaking undefined behavior). I'd like to replace the union with std::bit_cast that is constexpr, but on the other hand e.g. https://stackoverflow.com/a/78232359 kind of suggests to use cuda::std::bit_cast on CUDA 12.8.

I see in GCC 12 the std::bit_cast implementation is just a call to __builtin_bit_cast, and that e.g. in https://github.com/cms-sw/cmssw/blob/master/HeterogeneousCore/AlpakaInterface/interface/atomicMaxF.h we use edm::bit_cast (that just forwards to std::bit_cast or __builtin_bit_cast) only for CPU implementation (I don't remember the exact reason for that though, whether the edm::bit_cast didn't work on device code, or the intrinsics were "easier" on CUDA+HIP).

So perhaps for long term it would be better to define Alpaka-specific functions (ideally Alpaka could provide a portable bit_cast).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It currently does work, but I agree that would be better to be more careful about it (in a separate PR?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a related note, should I also reimplement std::distance to be safe?

@fwyzard
Copy link
Contributor

fwyzard commented Jan 27, 2025

unhold

@cmsbuild cmsbuild removed the hold label Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants