Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent telemetry deadlock with no-op #3910

Merged
merged 5 commits into from
Sep 13, 2024
Merged

Conversation

TonyCTHsu
Copy link
Contributor

What does this PR do?

It is still far too dangerous for Telemetry::Logger to hit a deadlock. This is the risk should be mitigated especially for telemetry which is considered to be a kind of best effort utility.

@github-actions github-actions bot added the core Involves Datadog core libraries label Sep 13, 2024
@TonyCTHsu TonyCTHsu added the dev/internal Other internal work that does not need to be included in the changelog label Sep 13, 2024
@TonyCTHsu TonyCTHsu added this to the 2.4.0 milestone Sep 13, 2024
@TonyCTHsu TonyCTHsu marked this pull request as ready for review September 13, 2024 10:29
@TonyCTHsu TonyCTHsu requested a review from a team as a code owner September 13, 2024 10:29
Copy link
Member

@anmarchenko anmarchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a very nice idea, thanks!

Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few notes!

lib/datadog/core/telemetry/logger.rb Show resolved Hide resolved
Comment on lines 10 to +11
# IMPORTANT: Invoking this method during the lifecycle of component initialization will
# cause a non-recoverable deadlock
# be no-op instead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While causing a deadlock is really bad, I think we should plan to fix the underlying problem. E.g. I think the core should provide to components the service of making telemetry available very early on; it seems reasonable for components to want to report data during initialization.

Comment on lines 28 to 29
# `allow_initialization: false` to prevent deadlock from components lifecycle
components = Datadog.send(:components, allow_initialization: false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this deadlock is quite subtle; I would suggest writing a bit more here to explain what's going on and why we're doing this (so in the future other folks don't need to "reverse engineer" what happened here)

Comment on lines 34 to 36
Datadog.logger.error(
'Fail to send telemetry log before components initialization or within components lifecycle'
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be seen by customers -- do we want to perhaps lower this to warn or even lower? It's not like they'll be able to do anything about this issue, since this is a bug on our side

@pr-commenter
Copy link

pr-commenter bot commented Sep 13, 2024

Benchmarks

Benchmark execution time: 2024-09-13 13:29:56

Comparing candidate commit 19c88eb in PR branch tonycthsu/avoid-deadlock with baseline commit c9994df in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 23 metrics, 2 unstable metrics.

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.85%. Comparing base (7dbcc40) to head (19c88eb).
Report is 2 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3910      +/-   ##
==========================================
- Coverage   97.85%   97.85%   -0.01%     
==========================================
  Files        1282     1282              
  Lines       76749    76765      +16     
  Branches     3759     3763       +4     
==========================================
+ Hits        75106    75119      +13     
- Misses       1643     1646       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines 28 to 30
# `allow_initialization: false` would avoid referencing the components via `safely_synchronize` (mutex)
# which could cause deadlock during components initialization.
components = Datadog.send(:components, allow_initialization: false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still suspect the current explanation requires quite a bit of context to understand. Perhaps something like

"Component initialization uses a mutex to avoid having concurrent initialization. Trying to access the telemetry component during initialization (specifically: from the thread that's actually doing the initialization) would cause a deadlock, since accessing the components would try to recursively lock the mutex.

To work around this, we use allow_initialization: false to avoid triggering this issue.

The downside is: this leaves us unable to report telemetry during component initialization."

Something like this?

@TonyCTHsu TonyCTHsu merged commit 3860fa7 into master Sep 13, 2024
187 of 190 checks passed
@TonyCTHsu TonyCTHsu deleted the tonycthsu/avoid-deadlock branch September 13, 2024 14:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Involves Datadog core libraries dev/internal Other internal work that does not need to be included in the changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants