Handle gracefully TPMI initilization fail (with missing MCFG tables) #749
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In enviornments without access to MCFG tables like unprivileged docker without mapping directories or when running inside VM, MCFG tables are missing and I would expect pcm to collect limited set of metrics but not fail.
Instead afetr adding SRF support (TPMI) with this commit we get this:
Initialization of TPMI that happens in initUncoreObjects,
with following code:
if (TPMIHandle::getNumInstances() == (size_t)num_sockets)
calls implicitly:
PFSInstances::get()
to get singleton,processprocessDVSEC()
->forAllIntelDevices()
to do discovery and then...getMCFGRecords()
->PciHandleMM::getMCFGRecords()
->readMCfg()
which throws "anonymous" exception, when MCFG files aren't not available (files cannot be open):
above exception is not handled anywhere and propagtes up to main routine, resulting with fatal error like this:
output is:
and pcm-sensor-server stops.
Last message from output: "terminate called without an active exception..." is not very informative and missleading, because missing access to MCFG tables is not a blocker/crtiical, as like in previously during Uncore discovery (see WARNING: enumeration of devices in UncorePMUDiscovery failed above) - just causing some metrics to be unavailable.
In other words, I would expect pcm/pcm-sensor-server not to fail resulting with TPMI metrics missing. I assume we need catch exception one level upper in
initUncoreObjects
as proposed in this pull request.Additionally this pull requests replaces "anonymous" std::exception() with runtime_error() so we will get more detailed warning like this:
When running this PR
output is more friendly and doesn't block collection of other metircs:
Generally, IMO it would be much helpful in future debuging if we replace "plain"
std::exception
withstd::runtime_error
so we when we run with PCM_NO_MAIN_EXCEPTION_HANDLER=1 and then if we forgot to handle exception then we can easily identify source of exception (instead of debugging all places withstd::exception
).