Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

potential speedup in HLT: AXOL1TLCondition::evaluateCondition consider caching the model? #46740

Open
slava77 opened this issue Nov 19, 2024 · 29 comments

Comments

@slava77
Copy link
Contributor

slava77 commented Nov 19, 2024

Looking at a profile of HLT in 14_1_X with callgrind, on MC running only MC_ReducedIterativeTracking_v22, I see that 78% of L1TGlobalProducer::produce is spent in l1t::AXOL1TLCondition::evaluateCondition
https://github.com/cms-sw/cmssw/blob/CMSSW_14_1_0_pre5/L1Trigger/L1TGlobal/src/AXOL1TLCondition.cc#L100

In my test L1TGlobalProducer::produce takes 9.7% of the time; in the full menu it's apparently around 2.5% 0.9% (updated to 0.9%, see notes below)

Of all the time spent in l1t::AXOL1TLCondition::evaluateCondition

  • hls4mlEmulator::ModelLoader::load_model() is 54%
  • hls4mlEmulator::ModelLoader::~ModelLoader() is 30%
  • GTADModel_emulator_v4::predict() is 15%
  • the rest is around 1%

IIUC, the load and destruction of the model happens 10 times per event; I don't see any dependence on the current event variables.
Some kind of caching may be useful to get HLT to run a bit faster (seems like 1.5% (updated) 0.6% or so).

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 19, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @slava77.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@mmusich
Copy link
Contributor

mmusich commented Nov 19, 2024

@aloeliger @artlbv FYI

@aloeliger
Copy link
Contributor

@quinnanm

@aloeliger
Copy link
Contributor

I can also take a look at this when I get some time, frankly I've been meaning to for a while.

@mmusich
Copy link
Contributor

mmusich commented Nov 19, 2024

In my test L1TGlobalProducer::produce takes 9.7% of the time; in the full menu it's apparently around 2.5%

in a more "realistic" setup it's more like sub-percent

Screenshot from 2024-11-19 22-26-13

(details from #45631 (comment))

@makortel
Copy link
Contributor

assign l1

@makortel
Copy link
Contributor

type performance-improvements

@cmsbuild
Copy link
Contributor

New categories assigned: l1

@aloeliger,@epalencia you have been requested to review this Pull request/Issue and eventually sign? Thanks

@slava77
Copy link
Contributor Author

slava77 commented Nov 19, 2024

in a more "realistic" setup it's more like sub-percent

My 2.5% comes from a more recent test (but on GRun menu) by @bdanzi
The odd part there is that e.g. in the first iteration in the first 6 jobs L1TGlobalProducer takes under 1%, but in the last two it's around 6%

@mmusich
Copy link
Contributor

mmusich commented Nov 19, 2024

My 2.5% comes from a more recent test (but on GRun menu) by @bdanzi

I'd rather trust more the manual measurement than what comes from the timing server, tbh.

@slava77
Copy link
Contributor Author

slava77 commented Nov 20, 2024

essentially all of the cost is in dlopen and dlclose calls.
https://github.com/cms-hls4ml/hls4mlEmulatorExtras/blob/v1.1.3/src/hls4ml/emulator.cc#L12-L22
create_model has orders of magnitude lower cost

perhaps just having the .so file loaded in the constructor or begin job is enough

@artlbv
Copy link
Contributor

artlbv commented Nov 21, 2024

IIUC, the load and destruction of the model happens 10 times per event; I don't see any dependence on the current event variables.

The "problem" is that the model is evaluated (and loaded) for every threshold and BX separately, even though it is the same model. If caching is possible that would help of course and in principle the model(s) are known at the beginning of the job as they are fixed in the L1 menu.

IMO it would be good to have some common approach to this loading of HLS4ML models within L1/CMSSW, as e.g. here it seems to be done differently:
https://github.com/cms-sw/cmssw/blob/CMSSW_14_2_X/L1Trigger/Phase2L1ParticleFlow/interface/JetId.h#L50-L53

@makortel
Copy link
Contributor

The "problem" is that the model is evaluated (and loaded) for every threshold and BX separately, even though it is the same model. If caching is possible that would help of course and in principle the model(s) are known at the beginning of the job as they are fixed in the L1 menu.

Given that the interface of hls4mlEmulator::Model is not const, the best that one can do is to call ModelLoader::load_model() once per module per stream (in constructor of edm::stream module, or using StreamCache with edm::global module, or in constructor of edm::one module).

This is how

unique_ptr<float[]> fDY_;
tensorflow::Session *sessionRef_;
std::shared_ptr<hls4mlEmulator::Model> modelRef_;
};

is used in
fBJetId_ = std::make_unique<JetId>(
cfg.getParameter<std::string>("NNInput"), cfg.getParameter<std::string>("NNOutput"), cache, fNParticles_);

(now if hls4mlEmulator::Model interface would be const and thread-efficient, we could avoid the per-stream copies of the hls4mlEmulator::Model)

@mmusich
Copy link
Contributor

mmusich commented Nov 22, 2024

assign hlt

  • to keep it on the radar

@cmsbuild
Copy link
Contributor

New categories assigned: hlt

@Martin-Grunewald,@mmusich you have been requested to review this Pull request/Issue and eventually sign? Thanks

@jpearkes
Copy link
Contributor

jpearkes commented Dec 10, 2024

Hi @slava77,

For these tests is the --timing flag being used in hltGetConfiguration? From tests we were running last year, we saw a big slowdown without using the --timing flag, e.g.
hltGetConfiguration /dev/CMSSW_13_3_0/GRun --data --output none --globaltag 132X_dataRun3_HLT_for2024TSGStudies_v1 --process jannicke_test1 --timing > hlt.py

I am attaching screenshots from a presentation here . The "Dec" results are when the --timing flag was not being used, the "New" results are when the --timing was being used. The difference in throughput was on the order of 50%.
Screenshot 2024-12-10 at 12 10 54 PM
Screenshot 2024-12-10 at 12 11 24 PM

@mmusich
Copy link
Contributor

mmusich commented Dec 10, 2024

@jpearkes

For these tests is the --timing flag being used in hltGetConfiguration?

As far as I understand these measurement were derived using the timing server, so the flag should have been used by default.
Irrespective of how the timing was run, there is an opportunity to make the code more efficient, see suggestion at #46740 (comment). Is it something that is going to picked up by the AXO team?

@jpearkes
Copy link
Contributor

@mmusich We were using the timing server for the tests too, and at least around this time last year it was not enabled by default.

Completely agreed that it is an opportunity to make the code more efficient. As after running the --timing flag the timing was no longer an issue the AXO deprioritized it. We could bump it up in priority if needed though. @quinnanm and @aloeliger are our main emulator contacts from both AXO and CICADA.

@missirol
Copy link
Contributor

missirol commented Jan 5, 2025

I gave it a try in https://github.com/missirol/cmssw/commits/devel_cmssw46740_v1, which includes two commits on top of CMSSW_15_0_0_pre1.

  • missirol@9dc135c moves several operations from L1TGlobalProducer::produce to L1TGlobalProducer::beginRun (these are operations that depend only on the L1T trigger menu, which (afaik) may only change during run transitions). In particular, this leads to initialising the map of "trigger conditions" (instances of classes inhering from ConditionEvaluation, one of them being AXOL1TLCondition) only once per run, as opposed to once per event. (this kind of update is unrelated to this issue per se)

  • missirol@4be82dd moves the initialisation of the hls4ml model in AXOL1TLCondition from its evaluateCondition method (used multiple times per event) to its constructor (called only a few times per run if missirol@9dc135c is used). missirol@4be82dd requires missirol/hls4mlEmulatorExtras@69f7d1e (an update of the hls4mlEmulatorExtras external), because the current methods of ModelLoader were not enough for me to get things to work properly, and since (as far as I understand) the Model and ModelLoader must have the same "lifetime" in order for things to work, it made sense to package them in a single class (called ModelWrapper in missirol/hls4mlEmulatorExtras@69f7d1e).

I tested this update with the script in [1] (running the L1-uGT emulator on roughly 50k events of 2024 pp data), and I got the outputs in [2] from the FastTimerService: it is not an accurate timing measurement, but it seems that these updates bring some speedup.

Did others have something else in mind ? If you think it's useful, I can put this in a PR (incl. missirol/hls4mlEmulatorExtras@69f7d1e, assuming that's an acceptable way to update that external).


[1]

import FWCore.ParameterSet.Config as cms

import FWCore.ParameterSet.VarParsing as vpo
opts = vpo.VarParsing('standard')
opts.setDefault('maxEvents', 1000)
opts.parseArguments()

process = cms.Process('TEST')

process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
process.options.wantSummary = True
process.maxEvents.input = opts.maxEvents

# Global Tag
from Configuration.AlCa.GlobalTag import GlobalTag
process.load('Configuration.StandardSequences.FrontierConditions_GlobalTag_cff')
process.GlobalTag = GlobalTag(process.GlobalTag, '141X_dataRun3_HLT_v2', '')

# Input source
process.source = cms.Source('PoolSource',
    fileNames = cms.untracked.vstring(
        'root://eoscms.cern.ch//eos/cms/store/data/Run2024I/EphemeralHLTPhysics7/RAW/v1/000/386/593/00000/00aa2e37-b442-4fa6-b087-73c54699784d.root',
        'root://eoscms.cern.ch//eos/cms/store/data/Run2024I/EphemeralHLTPhysics7/RAW/v1/000/386/593/00000/018e0570-6f5f-48db-ab22-3c156607f248.root',
        'root://eoscms.cern.ch//eos/cms/store/data/Run2024I/EphemeralHLTPhysics7/RAW/v1/000/386/593/00000/036cb164-49e7-490f-8a3b-c1f37f713deb.root',
        'root://eoscms.cern.ch//eos/cms/store/data/Run2024I/EphemeralHLTPhysics7/RAW/v1/000/386/593/00000/074e8d3f-d60f-46a4-8fbe-e16e86d2b693.root',
        'root://eoscms.cern.ch//eos/cms/store/data/Run2024I/EphemeralHLTPhysics7/RAW/v1/000/386/593/00000/078f24ee-6776-45b1-a437-5b58de809fb6.root',
        'root://eoscms.cern.ch//eos/cms/store/data/Run2024I/EphemeralHLTPhysics7/RAW/v1/000/386/807/00000/136403f7-a7fe-40c2-8af6-280f91f82976.root',
        'root://eoscms.cern.ch//eos/cms/store/data/Run2024I/EphemeralHLTPhysics7/RAW/v1/000/386/807/00000/247d49c3-5efe-4de1-8046-303e510b6a00.root',
        'root://eoscms.cern.ch//eos/cms/store/data/Run2024I/EphemeralHLTPhysics7/RAW/v1/000/386/807/00000/357f3289-56d5-413c-958f-ecc5844bd4ed.root',
        'root://eoscms.cern.ch//eos/cms/store/data/Run2024I/EphemeralHLTPhysics7/RAW/v1/000/386/807/00000/42371903-8e35-487a-a646-2ba330cf8cf3.root',
        'root://eoscms.cern.ch//eos/cms/store/data/Run2024I/EphemeralHLTPhysics7/RAW/v1/000/386/807/00000/46de66dd-f64c-4ee7-8553-1ebb2e08cc68.root',
    )
)

# EventSetup modules
process.GlobalParametersRcdSource = cms.ESSource('EmptyESSource',
    recordName = cms.string('L1TGlobalParametersRcd'),
    iovIsRunNotTime = cms.bool(True),
    firstValid = cms.vuint32(1)
)

process.GlobalParameters = cms.ESProducer('StableParametersTrivialProducer',
    # trigger decision
    NumberPhysTriggers = cms.uint32(512), # number of physics trigger algorithms
    # trigger objects
    NumberL1Muon = cms.uint32(8),    # muons
    NumberL1EGamma = cms.uint32(12), # e/gamma and isolated e/gamma objects
    NumberL1Jet = cms.uint32(12),    # jets
    NumberL1Tau = cms.uint32(12),    # taus
    # hardware
    NumberChips = cms.uint32(1),  # number of maximum chips defined in the xml file
    PinsOnChip = cms.uint32(512), # number of pins on the GTL condition chips
    # correspondence 'condition chip - GTL algorithm word' in the hardware
    # e.g.: chip 2: 0 - 95;  chip 1: 96 - 128 (191)
    OrderOfChip = cms.vint32(1),
)

# EventData modules
process.simGtExtFakeStage2Digis = cms.EDProducer('L1TExtCondProducer',
    bxFirst = cms.int32(-2),
    bxLast = cms.int32(2),
    setBptxAND = cms.bool(True),
    setBptxMinus = cms.bool(True),
    setBptxOR = cms.bool(True),
    setBptxPlus = cms.bool(True),
    tcdsRecordLabel = cms.InputTag('')
)

process.simGtStage2Digis = cms.EDProducer('L1TGlobalProducer',
    AlgoBlkInputTag = cms.InputTag(''),
    AlgorithmTriggersUnmasked = cms.bool(True),
    AlgorithmTriggersUnprescaled = cms.bool(True),
    EGammaInputTag = cms.InputTag('simCaloStage2Digis'),
    EtSumInputTag = cms.InputTag('simCaloStage2Digis'),
    ExtInputTag = cms.InputTag('simGtExtFakeStage2Digis'),
    JetInputTag = cms.InputTag('simCaloStage2Digis'),
    MuonInputTag = cms.InputTag('simGmtStage2Digis'),
    MuonShowerInputTag = cms.InputTag('simGmtShowerDigis'),
    TauInputTag = cms.InputTag('simCaloStage2Digis'),
    useMuonShowers = cms.bool(True),
    RequireMenuToMatchAlgoBlkInput = cms.bool(False),
)

# Sequence definition
process.l1tSequence = cms.Sequence( process.simGtExtFakeStage2Digis + process.simGtStage2Digis )

# Path definition
process.l1tPath = cms.Path( process.l1tSequence )

## Analyser of L1T-menu results
#process.l1tGlobalSummary = cms.EDAnalyzer( 'L1TGlobalSummary',
#    AlgInputTag = cms.InputTag( 'simGtStage2Digis' ),
#    ExtInputTag = cms.InputTag( 'simGtStage2Digis' ),
#    MinBx = cms.int32( 0 ),
#    MaxBx = cms.int32( 0 ),
#    DumpTrigResults = cms.bool( False ),
#    DumpRecord = cms.bool( False ),
#    DumpTrigSummary = cms.bool( True ),
#    ReadPrescalesFromFile = cms.bool( False ),
#    psFileName = cms.string( '' ),
#    psColumn = cms.int32( 0 )
#)
#
## EndPath definition
#process.l1tEndPath = cms.EndPath( process.l1tGlobalSummary )

# MessageLogger
process.load('FWCore.MessageService.MessageLogger_cfi')
process.MessageLogger.cerr.FwkReport.reportEvery = 1000
process.MessageLogger.L1TGlobalSummary = cms.untracked.PSet()
process.MessageLogger.FastReport = cms.untracked.PSet()

# FastTimerService
process.FastTimerService = cms.Service( "FastTimerService",
    printEventSummary = cms.untracked.bool( False ),
    printRunSummary = cms.untracked.bool( False ),
    printJobSummary = cms.untracked.bool( True ),
    writeJSONSummary = cms.untracked.bool( False ),
    jsonFileName = cms.untracked.string( "resources.json" ),
    enableDQM = cms.untracked.bool( False ),
    enableDQMbyModule = cms.untracked.bool( False ),
    enableDQMbyPath = cms.untracked.bool( False ),
    enableDQMbyLumiSection = cms.untracked.bool( False ),
    enableDQMbyProcesses = cms.untracked.bool( False ),
    enableDQMTransitions = cms.untracked.bool( False ),
    dqmTimeRange = cms.untracked.double( 2000.0 ),
    dqmTimeResolution = cms.untracked.double( 5.0 ),
    dqmMemoryRange = cms.untracked.double( 1000000.0 ),
    dqmMemoryResolution = cms.untracked.double( 5000.0 ),
    dqmPathTimeRange = cms.untracked.double( 100.0 ),
    dqmPathTimeResolution = cms.untracked.double( 0.5 ),
    dqmPathMemoryRange = cms.untracked.double( 1000000.0 ),
    dqmPathMemoryResolution = cms.untracked.double( 5000.0 ),
    dqmModuleTimeRange = cms.untracked.double( 40.0 ),
    dqmModuleTimeResolution = cms.untracked.double( 0.2 ),
    dqmModuleMemoryRange = cms.untracked.double( 100000.0 ),
    dqmModuleMemoryResolution = cms.untracked.double( 500.0 ),
    dqmLumiSectionsRange = cms.untracked.uint32( 2500 ),
    dqmPath = cms.untracked.string( "HLT/TimerService" ),
)

[2] Tested on lxplus8 (without full control over the machine, but at least I ran the test a few times and got consistent results). The L1-uGT emulator corresponds to the module simGtStage2Digis.

CMSSW_15_0_0_pre1

FastReport ---------------------------- Job Summary ----------------------------
FastReport   CPU time avg.      when run  Real time avg.      when run     Alloc. avg.      when run   Dealloc. avg.      when run  Modules
FastReport         0.0 ms         0.0 ms         0.1 ms         0.1 ms          +3 kB          +3 kB          -3 kB          -3 kB  source
FastReport         2.5 ms                        2.9 ms                       +457 kB                       -322 kB                 process TEST
FastReport         0.0 ms         0.0 ms         0.0 ms         0.0 ms          +1 kB          +1 kB          -1 kB          -1 kB    simGtExtFakeStage2Digis
FastReport         2.5 ms         2.5 ms         2.9 ms         2.9 ms        +455 kB        +455 kB        -321 kB        -321 kB    simGtStage2Digis
FastReport         0.0 ms         0.0 ms         0.0 ms         0.0 ms          +0 kB          +0 kB          +0 kB          +0 kB    TriggerResults
FastReport         0.0 ms         0.0 ms         0.0 ms         0.0 ms          +0 kB          +0 kB          +0 kB          +0 kB    l1tPath
FastReport         2.5 ms                        3.0 ms                       +461 kB                       -326 kB                 total
FastReport         0.0 ms                        0.0 ms                         +0 kB                         +0 kB                 eventsetup
FastReport         0.1 ms                        0.6 ms                        +41 kB                       -174 kB                 other
FastReport         0.0 ms                        0.0 ms                         +0 kB                         +0 kB                 idle

CMSSW_15_0_0_pre1 + (missirol@9dc135c + missirol@4be82dd) + missirol/hls4mlEmulatorExtras@69f7d1e

FastReport ---------------------------- Job Summary ----------------------------
FastReport   CPU time avg.      when run  Real time avg.      when run     Alloc. avg.      when run   Dealloc. avg.      when run  Modules
FastReport         0.0 ms         0.0 ms         0.1 ms         0.1 ms          +3 kB          +3 kB          -3 kB          -3 kB  source
FastReport         0.4 ms                        0.4 ms                       +409 kB                       -273 kB                 process TEST
FastReport         0.0 ms         0.0 ms         0.0 ms         0.0 ms          +1 kB          +1 kB          -1 kB          -1 kB    simGtExtFakeStage2Digis
FastReport         0.4 ms         0.4 ms         0.4 ms         0.4 ms        +407 kB        +407 kB        -272 kB        -272 kB    simGtStage2Digis
FastReport         0.0 ms         0.0 ms         0.0 ms         0.0 ms          +0 kB          +0 kB          +0 kB          +0 kB    TriggerResults
FastReport         0.0 ms         0.0 ms         0.0 ms         0.0 ms          +0 kB          +0 kB          +0 kB          +0 kB    l1tPath
FastReport         0.4 ms                        0.5 ms                       +413 kB                       -277 kB                 total
FastReport         0.0 ms                        0.0 ms                         +0 kB                         +0 kB                 eventsetup
FastReport         0.1 ms                        0.5 ms                        +41 kB                       -174 kB                 other
FastReport         0.0 ms                        0.0 ms                         +0 kB                         +0 kB                 idle

@missirol
Copy link
Contributor

missirol commented Jan 5, 2025

After reading the discussion in #40277 (in particular, the part on not allowing copy/move operators for hls4mlEmulator::ModelLoader), I force-pushed a couple of updates and edited my previous comment accordingly (#46740 (comment)).

@mmusich
Copy link
Contributor

mmusich commented Jan 8, 2025

Did others have something else in mind ? If you think it's useful, I can put this in a PR (incl. missirol/hls4mlEmulatorExtras@69f7d1e, assuming that's an acceptable way to update that external).

@cms-sw/l1-l2 can you please comment? From the HLT side, I think a PR would be useful.

@missirol
Copy link
Contributor

missirol commented Jan 9, 2025

Latest (and hopefully, final) evolution of #46740 (comment).

The latest changes are mainly in the hls4mlEmulatorExtras external, going for a minimal update of hls4mlEmulatorExtras::ModelLoader, rather than introducing a new class.

If there are no objections, I'll open PRs to CMSSW and hls4mlEmulatorExtras by the end of the week. (the CMSSW PR will initially conflict with #47030, then one of the two will have to be rebased eventually)

@aloeliger
Copy link
Contributor

Did others have something else in mind ? If you think it's useful, I can put this in a PR (incl. missirol/hls4mlEmulatorExtras@69f7d1e, assuming that's an acceptable way to update that external).

@cms-sw/l1-l2 can you please comment? From the HLT side, I think a PR would be useful.

Sorry, catching back up after the holidays here. I'm a little confused on the intended purpose of the ModelWrapper. It's fine, and I can approve that PR to the hls4ml extras myself on a moment's notice, but I would have thought the primary change needed here was caching the model itself to the uGT GlobalBoard or whichever top level GT producer contains most of the module members for use. Did I miss plans or PRs from the AXO team to do that? Is this just a convenience?

@mmusich
Copy link
Contributor

mmusich commented Jan 9, 2025

but I would have thought the primary change needed here was caching the model itself to the uGT GlobalBoard or whichever top level GT producer contains most of the module members for use. Did I miss plans or PRs from the AXO team to do that? Is this just a convenience?

Isn't it what Marino proposed as missirol:devel_cmssw46740_v2 == CMSSW_15_0_0_pre1 + missirol@9dc135c + missirol@fe94271. (see #46740 (comment) ) ?

@missirol
Copy link
Contributor

missirol commented Jan 9, 2025

Did others have something else in mind ? If you think it's useful, I can put this in a PR (incl. missirol/hls4mlEmulatorExtras@69f7d1e, assuming that's an acceptable way to update that external).

@cms-sw/l1-l2 can you please comment? From the HLT side, I think a PR would be useful.

Sorry, catching back up after the holidays here. I'm a little confused on the intended purpose of the ModelWrapper. It's fine, and I can approve that PR to the hls4ml extras myself on a moment's notice, but I would have thought the primary change needed here was caching the model itself to the uGT GlobalBoard or whichever top level GT producer contains most of the module members for use. Did I miss plans or PRs from the AXO team to do that? Is this just a convenience?

In short, ModelWrapper is just something I used for convenience while testing (I don't fully understand why one needs to carry around both a Model and ModelLoader in client code, having a single object would be more intuitive to me). In my latest comment (#46740 (comment)), I removed the ModelWrapper and simply updated minimally ModelLoader: that will be enough for this issue.

As for this issue itself, the point is simply that (a) the construction of the GT conditions is moved from produce to beginRun, and (b) the anomaly-detection-model loading is moved from the evaluateCondition to the constructor of the GT condition, so the model loading is done only once per condition as opposed to once per evaluation (I guess, in other words, the model is cached in the GT condition). I don't see how one could do it less often, since a job can in principle span multiple L1T menus (across different runs), and each menu can contain GT conditions using different models. That's as far as I got: if L1T experts can provide a better update before 15_0_X closes, I'm of course in favor of that (over my suggestion). In the meantime, I'll prepare the PRs.

@missirol
Copy link
Contributor

missirol commented Jan 9, 2025

#46740 (comment) is now implemented in #47070 + cms-hls4ml/hls4mlEmulatorExtras#6.

@missirol
Copy link
Contributor

missirol commented Jan 17, 2025

Estimated HLT throughput [1] [2] for the IBs before and after #47070 was merged (so #47070 is not the only change; I suspect #47030 has no impact; I'm not sure about #47047). There is a ~1.7% speedup overall; the baseline number has larger variance and maybe a downward fluctuation, I re-tried a few times but couldn't get it to be more stable today; at least, the speedup is clear in the L1-uGT emulator (6.2 ms vs 2.2 ms per event).

  • CMSSW_15_0_X_2025-01-14-1100: 648.3 +/- 6.5 events/sec (json).
  • CMSSW_15_0_X_2025-01-14-2300: 659.5 +/- 1.8 events/sec (json).

Thanks @slava77 for opening this issue.


[1]

  • Input data: run-383631, LSs 476-479, ~40k events (PU ~64).
  • HLT menu: /dev/CMSSW_14_2_0/GRun/V11 (latest GRun menu).
  • Node: hilton-c2b01-44-01 (same hardware as a 2022/23 HLT node) (CPUs: 2 AMD EPYC 7763 64-Core; GPUs: 2 NVIDIA Tesla T4).
  • 8 jobs, 32 threads and 24 streams per job.
  • NVIDIA MPS enabled.
  • Each of the two measurements is the average of 4 repetitions.

[2]

Cannot find MPS control daemon process
Compute mode is already set to DEFAULT for GPU 00000000:23:00.0.
Compute mode is already set to DEFAULT for GPU 00000000:E2:00.0.
All done.
Set compute mode to EXCLUSIVE_PROCESS for GPU 00000000:23:00.0.
Set compute mode to EXCLUSIVE_PROCESS for GPU 00000000:E2:00.0.
All done.
-----------------------------------------------------------------------------------------------------------------------------
 /fff/user/missirol/output/hltThroughputUtils/out_250117_cmssw47070_0587d2d-baseline-CMSSW_15_0_X_2025-01-14-1100-00-gpu_mps
-----------------------------------------------------------------------------------------------------------------------------
2 CPUs:
  0: AMD EPYC 7763 64-Core Processor (64 cores, 128 threads)
  1: AMD EPYC 7763 64-Core Processor (64 cores, 128 threads)

2 visible NVIDIA CUDA GPUs:
  0: Tesla T4
  1: Tesla T4

No visible AMD ROCm GPUs

Benchmarking only I/O
Warming up

Running 4 times over 40100 events with 8 jobs, each with 32 threads, 24 streams, and 1 GPUs
 12540.0 ±   1.0 ev/s (39800 events, 97.2% overlap),  12540.4 ±   1.0 ev/s (⩾ 38200 events, overlap-only)
 12674.3 ±   1.0 ev/s (39800 events, 98.6% overlap),  12673.9 ±   1.0 ev/s (⩾ 38800 events, overlap-only)
 12559.6 ±   1.3 ev/s (39800 events, 98.7% overlap),  12556.5 ±   1.3 ev/s (⩾ 38900 events, overlap-only)
 12526.8 ±   1.2 ev/s (39800 events, 98.6% overlap),  12526.3 ±   1.3 ev/s (⩾ 38800 events, overlap-only)
 --------------------
 12575.2 ±  67.4 ev/s,  12574.3 ±  67.6 ev/s (⩾ 38200 events, overlap-only)


Benchmarking out_250117_cmssw47070_0587d2d_baseline_dump.py
Warming up

Running 4 times over 40100 events with 8 jobs, each with 32 threads, 24 streams, and 1 GPUs
   654.0 ±   0.0 ev/s (39800 events, 99.0% overlap),    654.0 ±   0.0 ev/s (⩾ 38900 events, overlap-only)
   646.2 ±   0.1 ev/s (39800 events, 97.9% overlap),    645.9 ±   0.0 ev/s (⩾ 35100 events, overlap-only)
   653.1 ±   0.0 ev/s (39800 events, 99.2% overlap),    653.1 ±   0.0 ev/s (⩾ 38600 events, overlap-only)
   640.1 ±   0.1 ev/s (39800 events, 95.9% overlap),    639.3 ±   0.1 ev/s (⩾ 34600 events, overlap-only)
 --------------------
   648.3 ±   6.5 ev/s,    648.1 ±   6.9 ev/s (⩾ 34600 events, overlap-only)

Set compute mode to DEFAULT for GPU 00000000:23:00.0.
Set compute mode to DEFAULT for GPU 00000000:E2:00.0.
All done.
Cannot find MPS control daemon process
Compute mode is already set to DEFAULT for GPU 00000000:23:00.0.
Compute mode is already set to DEFAULT for GPU 00000000:E2:00.0.
All done.
Set compute mode to EXCLUSIVE_PROCESS for GPU 00000000:23:00.0.
Set compute mode to EXCLUSIVE_PROCESS for GPU 00000000:E2:00.0.
All done.
-----------------------------------------------------------------------------------------------------------------------------
 /fff/user/missirol/output/hltThroughputUtils/out_250117_cmssw47070_0587d2d-baseline-CMSSW_15_0_X_2025-01-14-2300-00-gpu_mps
-----------------------------------------------------------------------------------------------------------------------------
2 CPUs:
  0: AMD EPYC 7763 64-Core Processor (64 cores, 128 threads)
  1: AMD EPYC 7763 64-Core Processor (64 cores, 128 threads)

2 visible NVIDIA CUDA GPUs:
  0: Tesla T4
  1: Tesla T4

No visible AMD ROCm GPUs

Benchmarking only I/O
Warming up

Running 4 times over 40100 events with 8 jobs, each with 32 threads, 24 streams, and 1 GPUs
 12396.3 ±   1.2 ev/s (39800 events, 98.9% overlap),  12394.7 ±   1.2 ev/s (⩾ 39000 events, overlap-only)
 12500.3 ±   0.7 ev/s (39800 events, 99.7% overlap),  12499.9 ±   0.7 ev/s (⩾ 39400 events, overlap-only)
 12456.2 ±   1.2 ev/s (39800 events, 99.2% overlap),  12455.0 ±   1.2 ev/s (⩾ 39200 events, overlap-only)
 12607.7 ±   1.3 ev/s (39800 events, 99.0% overlap),  12606.5 ±   1.3 ev/s (⩾ 39000 events, overlap-only)
 --------------------
 12490.2 ±  89.2 ev/s,  12489.0 ±  89.4 ev/s (⩾ 39000 events, overlap-only)


Benchmarking out_250117_cmssw47070_0587d2d_baseline_dump.py
Warming up

Running 4 times over 40100 events with 8 jobs, each with 32 threads, 24 streams, and 1 GPUs
   658.4 ±   0.0 ev/s (39800 events, 99.1% overlap),    658.4 ±   0.0 ev/s (⩾ 38600 events, overlap-only)
   658.4 ±   0.0 ev/s (39800 events, 99.2% overlap),    658.4 ±   0.0 ev/s (⩾ 38400 events, overlap-only)
   662.2 ±   0.0 ev/s (39800 events, 99.0% overlap),    662.2 ±   0.0 ev/s (⩾ 38200 events, overlap-only)
   658.9 ±   0.0 ev/s (39800 events, 98.9% overlap),    658.9 ±   0.0 ev/s (⩾ 38400 events, overlap-only)
 --------------------
   659.5 ±   1.8 ev/s,    659.5 ±   1.8 ev/s (⩾ 38200 events, overlap-only)

Set compute mode to DEFAULT for GPU 00000000:23:00.0.
Set compute mode to DEFAULT for GPU 00000000:E2:00.0.
All done.

@mmusich
Copy link
Contributor

mmusich commented Jan 19, 2025

+hlt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants