-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent telemetry deadlock with no-op #3910
Conversation
4d5b459
to
0929afc
Compare
0929afc
to
503bb5a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a very nice idea, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few notes!
# IMPORTANT: Invoking this method during the lifecycle of component initialization will | ||
# cause a non-recoverable deadlock | ||
# be no-op instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While causing a deadlock is really bad, I think we should plan to fix the underlying problem. E.g. I think the core should provide to components the service of making telemetry available very early on; it seems reasonable for components to want to report data during initialization.
lib/datadog/core/telemetry/logger.rb
Outdated
# `allow_initialization: false` to prevent deadlock from components lifecycle | ||
components = Datadog.send(:components, allow_initialization: false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for this deadlock is quite subtle; I would suggest writing a bit more here to explain what's going on and why we're doing this (so in the future other folks don't need to "reverse engineer" what happened here)
lib/datadog/core/telemetry/logger.rb
Outdated
Datadog.logger.error( | ||
'Fail to send telemetry log before components initialization or within components lifecycle' | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be seen by customers -- do we want to perhaps lower this to warn or even lower? It's not like they'll be able to do anything about this issue, since this is a bug on our side
BenchmarksBenchmark execution time: 2024-09-13 13:29:56 Comparing candidate commit 19c88eb in PR branch Found 0 performance improvements and 0 performance regressions! Performance is the same for 23 metrics, 2 unstable metrics. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #3910 +/- ##
==========================================
- Coverage 97.85% 97.85% -0.01%
==========================================
Files 1282 1282
Lines 76749 76765 +16
Branches 3759 3763 +4
==========================================
+ Hits 75106 75119 +13
- Misses 1643 1646 +3 ☔ View full report in Codecov by Sentry. |
lib/datadog/core/telemetry/logger.rb
Outdated
# `allow_initialization: false` would avoid referencing the components via `safely_synchronize` (mutex) | ||
# which could cause deadlock during components initialization. | ||
components = Datadog.send(:components, allow_initialization: false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still suspect the current explanation requires quite a bit of context to understand. Perhaps something like
"Component initialization uses a mutex to avoid having concurrent initialization. Trying to access the telemetry component during initialization (specifically: from the thread that's actually doing the initialization) would cause a deadlock, since accessing the components would try to recursively lock the mutex.
To work around this, we use allow_initialization: false to avoid triggering this issue.
The downside is: this leaves us unable to report telemetry during component initialization."
Something like this?
What does this PR do?
It is still far too dangerous for
Telemetry::Logger
to hit a deadlock. This is the risk should be mitigated especially for telemetry which is considered to be a kind of best effort utility.