-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
potential speedup in HLT: AXOL1TLCondition::evaluateCondition consider caching the model? #46740
Comments
cms-bot internal usage |
A new Issue was created by @slava77. @Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
@aloeliger @artlbv FYI |
I can also take a look at this when I get some time, frankly I've been meaning to for a while. |
in a more "realistic" setup it's more like sub-percent (details from #45631 (comment)) |
assign l1 |
type performance-improvements |
New categories assigned: l1 @aloeliger,@epalencia you have been requested to review this Pull request/Issue and eventually sign? Thanks |
My 2.5% comes from a more recent test (but on GRun menu) by @bdanzi |
I'd rather trust more the manual measurement than what comes from the timing server, tbh. |
essentially all of the cost is in perhaps just having the .so file loaded in the constructor or begin job is enough |
The "problem" is that the model is evaluated (and loaded) for every threshold and BX separately, even though it is the same model. If caching is possible that would help of course and in principle the model(s) are known at the beginning of the job as they are fixed in the L1 menu. IMO it would be good to have some common approach to this loading of HLS4ML models within L1/CMSSW, as e.g. here it seems to be done differently: |
Given that the interface of This is how cmssw/L1Trigger/Phase2L1ParticleFlow/interface/JetId.h Lines 50 to 53 in 92333e3
is used in cmssw/L1Trigger/Phase2L1ParticleFlow/plugins/L1BJetProducer.cc Lines 51 to 52 in 92333e3
(now if |
assign hlt
|
New categories assigned: hlt @Martin-Grunewald,@mmusich you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Hi @slava77, For these tests is the --timing flag being used in hltGetConfiguration? From tests we were running last year, we saw a big slowdown without using the --timing flag, e.g. I am attaching screenshots from a presentation here . The "Dec" results are when the |
As far as I understand these measurement were derived using the timing server, so the flag should have been used by default. |
@mmusich We were using the timing server for the tests too, and at least around this time last year it was not enabled by default. Completely agreed that it is an opportunity to make the code more efficient. As after running the |
I gave it a try in https://github.com/missirol/cmssw/commits/devel_cmssw46740_v1, which includes two commits on top of
I tested this update with the script in [1] (running the L1-uGT emulator on roughly 50k events of 2024 pp data), and I got the outputs in [2] from the Did others have something else in mind ? If you think it's useful, I can put this in a PR (incl. missirol/hls4mlEmulatorExtras@69f7d1e, assuming that's an acceptable way to update that external). [1] import FWCore.ParameterSet.Config as cms
import FWCore.ParameterSet.VarParsing as vpo
opts = vpo.VarParsing('standard')
opts.setDefault('maxEvents', 1000)
opts.parseArguments()
process = cms.Process('TEST')
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
process.options.wantSummary = True
process.maxEvents.input = opts.maxEvents
# Global Tag
from Configuration.AlCa.GlobalTag import GlobalTag
process.load('Configuration.StandardSequences.FrontierConditions_GlobalTag_cff')
process.GlobalTag = GlobalTag(process.GlobalTag, '141X_dataRun3_HLT_v2', '')
# Input source
process.source = cms.Source('PoolSource',
fileNames = cms.untracked.vstring(
'root://eoscms.cern.ch//eos/cms/store/data/Run2024I/EphemeralHLTPhysics7/RAW/v1/000/386/593/00000/00aa2e37-b442-4fa6-b087-73c54699784d.root',
'root://eoscms.cern.ch//eos/cms/store/data/Run2024I/EphemeralHLTPhysics7/RAW/v1/000/386/593/00000/018e0570-6f5f-48db-ab22-3c156607f248.root',
'root://eoscms.cern.ch//eos/cms/store/data/Run2024I/EphemeralHLTPhysics7/RAW/v1/000/386/593/00000/036cb164-49e7-490f-8a3b-c1f37f713deb.root',
'root://eoscms.cern.ch//eos/cms/store/data/Run2024I/EphemeralHLTPhysics7/RAW/v1/000/386/593/00000/074e8d3f-d60f-46a4-8fbe-e16e86d2b693.root',
'root://eoscms.cern.ch//eos/cms/store/data/Run2024I/EphemeralHLTPhysics7/RAW/v1/000/386/593/00000/078f24ee-6776-45b1-a437-5b58de809fb6.root',
'root://eoscms.cern.ch//eos/cms/store/data/Run2024I/EphemeralHLTPhysics7/RAW/v1/000/386/807/00000/136403f7-a7fe-40c2-8af6-280f91f82976.root',
'root://eoscms.cern.ch//eos/cms/store/data/Run2024I/EphemeralHLTPhysics7/RAW/v1/000/386/807/00000/247d49c3-5efe-4de1-8046-303e510b6a00.root',
'root://eoscms.cern.ch//eos/cms/store/data/Run2024I/EphemeralHLTPhysics7/RAW/v1/000/386/807/00000/357f3289-56d5-413c-958f-ecc5844bd4ed.root',
'root://eoscms.cern.ch//eos/cms/store/data/Run2024I/EphemeralHLTPhysics7/RAW/v1/000/386/807/00000/42371903-8e35-487a-a646-2ba330cf8cf3.root',
'root://eoscms.cern.ch//eos/cms/store/data/Run2024I/EphemeralHLTPhysics7/RAW/v1/000/386/807/00000/46de66dd-f64c-4ee7-8553-1ebb2e08cc68.root',
)
)
# EventSetup modules
process.GlobalParametersRcdSource = cms.ESSource('EmptyESSource',
recordName = cms.string('L1TGlobalParametersRcd'),
iovIsRunNotTime = cms.bool(True),
firstValid = cms.vuint32(1)
)
process.GlobalParameters = cms.ESProducer('StableParametersTrivialProducer',
# trigger decision
NumberPhysTriggers = cms.uint32(512), # number of physics trigger algorithms
# trigger objects
NumberL1Muon = cms.uint32(8), # muons
NumberL1EGamma = cms.uint32(12), # e/gamma and isolated e/gamma objects
NumberL1Jet = cms.uint32(12), # jets
NumberL1Tau = cms.uint32(12), # taus
# hardware
NumberChips = cms.uint32(1), # number of maximum chips defined in the xml file
PinsOnChip = cms.uint32(512), # number of pins on the GTL condition chips
# correspondence 'condition chip - GTL algorithm word' in the hardware
# e.g.: chip 2: 0 - 95; chip 1: 96 - 128 (191)
OrderOfChip = cms.vint32(1),
)
# EventData modules
process.simGtExtFakeStage2Digis = cms.EDProducer('L1TExtCondProducer',
bxFirst = cms.int32(-2),
bxLast = cms.int32(2),
setBptxAND = cms.bool(True),
setBptxMinus = cms.bool(True),
setBptxOR = cms.bool(True),
setBptxPlus = cms.bool(True),
tcdsRecordLabel = cms.InputTag('')
)
process.simGtStage2Digis = cms.EDProducer('L1TGlobalProducer',
AlgoBlkInputTag = cms.InputTag(''),
AlgorithmTriggersUnmasked = cms.bool(True),
AlgorithmTriggersUnprescaled = cms.bool(True),
EGammaInputTag = cms.InputTag('simCaloStage2Digis'),
EtSumInputTag = cms.InputTag('simCaloStage2Digis'),
ExtInputTag = cms.InputTag('simGtExtFakeStage2Digis'),
JetInputTag = cms.InputTag('simCaloStage2Digis'),
MuonInputTag = cms.InputTag('simGmtStage2Digis'),
MuonShowerInputTag = cms.InputTag('simGmtShowerDigis'),
TauInputTag = cms.InputTag('simCaloStage2Digis'),
useMuonShowers = cms.bool(True),
RequireMenuToMatchAlgoBlkInput = cms.bool(False),
)
# Sequence definition
process.l1tSequence = cms.Sequence( process.simGtExtFakeStage2Digis + process.simGtStage2Digis )
# Path definition
process.l1tPath = cms.Path( process.l1tSequence )
## Analyser of L1T-menu results
#process.l1tGlobalSummary = cms.EDAnalyzer( 'L1TGlobalSummary',
# AlgInputTag = cms.InputTag( 'simGtStage2Digis' ),
# ExtInputTag = cms.InputTag( 'simGtStage2Digis' ),
# MinBx = cms.int32( 0 ),
# MaxBx = cms.int32( 0 ),
# DumpTrigResults = cms.bool( False ),
# DumpRecord = cms.bool( False ),
# DumpTrigSummary = cms.bool( True ),
# ReadPrescalesFromFile = cms.bool( False ),
# psFileName = cms.string( '' ),
# psColumn = cms.int32( 0 )
#)
#
## EndPath definition
#process.l1tEndPath = cms.EndPath( process.l1tGlobalSummary )
# MessageLogger
process.load('FWCore.MessageService.MessageLogger_cfi')
process.MessageLogger.cerr.FwkReport.reportEvery = 1000
process.MessageLogger.L1TGlobalSummary = cms.untracked.PSet()
process.MessageLogger.FastReport = cms.untracked.PSet()
# FastTimerService
process.FastTimerService = cms.Service( "FastTimerService",
printEventSummary = cms.untracked.bool( False ),
printRunSummary = cms.untracked.bool( False ),
printJobSummary = cms.untracked.bool( True ),
writeJSONSummary = cms.untracked.bool( False ),
jsonFileName = cms.untracked.string( "resources.json" ),
enableDQM = cms.untracked.bool( False ),
enableDQMbyModule = cms.untracked.bool( False ),
enableDQMbyPath = cms.untracked.bool( False ),
enableDQMbyLumiSection = cms.untracked.bool( False ),
enableDQMbyProcesses = cms.untracked.bool( False ),
enableDQMTransitions = cms.untracked.bool( False ),
dqmTimeRange = cms.untracked.double( 2000.0 ),
dqmTimeResolution = cms.untracked.double( 5.0 ),
dqmMemoryRange = cms.untracked.double( 1000000.0 ),
dqmMemoryResolution = cms.untracked.double( 5000.0 ),
dqmPathTimeRange = cms.untracked.double( 100.0 ),
dqmPathTimeResolution = cms.untracked.double( 0.5 ),
dqmPathMemoryRange = cms.untracked.double( 1000000.0 ),
dqmPathMemoryResolution = cms.untracked.double( 5000.0 ),
dqmModuleTimeRange = cms.untracked.double( 40.0 ),
dqmModuleTimeResolution = cms.untracked.double( 0.2 ),
dqmModuleMemoryRange = cms.untracked.double( 100000.0 ),
dqmModuleMemoryResolution = cms.untracked.double( 500.0 ),
dqmLumiSectionsRange = cms.untracked.uint32( 2500 ),
dqmPath = cms.untracked.string( "HLT/TimerService" ),
) [2] Tested on
|
After reading the discussion in #40277 (in particular, the part on not allowing copy/move operators for |
@cms-sw/l1-l2 can you please comment? From the HLT side, I think a PR would be useful. |
Latest (and hopefully, final) evolution of #46740 (comment).
The latest changes are mainly in the If there are no objections, I'll open PRs to CMSSW and |
Sorry, catching back up after the holidays here. I'm a little confused on the intended purpose of the |
Isn't it what Marino proposed as missirol:devel_cmssw46740_v2 == CMSSW_15_0_0_pre1 + missirol@9dc135c + missirol@fe94271. (see #46740 (comment) ) ? |
In short, As for this issue itself, the point is simply that (a) the construction of the GT conditions is moved from |
#46740 (comment) is now implemented in #47070 + cms-hls4ml/hls4mlEmulatorExtras#6. |
Estimated HLT throughput [1] [2] for the IBs before and after #47070 was merged (so #47070 is not the only change; I suspect #47030 has no impact; I'm not sure about #47047). There is a ~1.7% speedup overall; the baseline number has larger variance and maybe a downward fluctuation, I re-tried a few times but couldn't get it to be more stable today; at least, the speedup is clear in the L1-uGT emulator (6.2 ms vs 2.2 ms per event).
Thanks @slava77 for opening this issue. [1]
[2]
|
+hlt
|
Looking at a profile of HLT in 14_1_X with callgrind, on MC running only
MC_ReducedIterativeTracking_v22
, I see that 78% ofL1TGlobalProducer::produce
is spent inl1t::AXOL1TLCondition::evaluateCondition
https://github.com/cms-sw/cmssw/blob/CMSSW_14_1_0_pre5/L1Trigger/L1TGlobal/src/AXOL1TLCondition.cc#L100
In my test
L1TGlobalProducer::produce
takes 9.7% of the time; in the full menu it's apparently around2.5%0.9% (updated to 0.9%, see notes below)Of all the time spent in
l1t::AXOL1TLCondition::evaluateCondition
hls4mlEmulator::ModelLoader::load_model()
is 54%hls4mlEmulator::ModelLoader::~ModelLoader()
is 30%GTADModel_emulator_v4::predict()
is 15%IIUC, the load and destruction of the model happens 10 times per event; I don't see any dependence on the current event variables.
Some kind of caching may be useful to get HLT to run a bit faster (seems like
1.5%(updated) 0.6% or so).The text was updated successfully, but these errors were encountered: