Commit 03b420d
## Summary of changes
Avoid having the callback from a `Timer` reconfigure itself
## Reason for change
We have seem [some instances of
StackOverflowExceptions](https://app.datadoghq.com/error-tracking?query=source%3Adotnet%20service%3Ainstrumentation-telemetry-data%20%40tags.crash_datadog%3Atrue%20%40library_version.major%3A3%20-%40tags.crash_profiler%3Atrue%20-%40tags.crash_appsec%3Atrue%20-%40tags.crash_runtime_metrics%3Atrue&et-side=data&link_source=monitor_notif&monitor_id=236533719&monitor_sub_type=.new%28%29&source=all&sp=%5B%7B%22p%22%3A%7B%22issueId%22%3A%22664a7966-c3a5-11f0-b289-da7ad0900002%22%7D%2C%22i%22%3A%22error-tracking-issue%22%7D%5D&from_ts=1763291319000&to_ts=1763377719000&live=false)
that point the finger at the `RuntimeMetricsWriter` and the way it
handles `PushEvents`:
```
0x7FF9817A3F64 Datadog.Trace.dll!Datadog.Trace.RuntimeMetrics.RuntimeMetricsWriter.PushEvents
0x7FF980CC7FD7 Datadog.Trace.dll!Datadog.Trace.RuntimeMetrics.RuntimeMetricsWriter.GetCurrentProcessMetrics
0x7FF980CC81DC Datadog.Trace.dll!Datadog.Trace.RuntimeMetrics.ProcessSnapshotRuntimeInformation.GetCurrentProcessMetrics
0x7FF980CC7ECD Datadog.Trace.dll!Datadog.Trace.RuntimeMetrics.RuntimeMetricsWriter.GetCurrentProcessMetrics
0x7FF9817A392C Datadog.Trace.dll!Datadog.Trace.RuntimeMetrics.RuntimeMetricsWriter.PushEvents
0x7FF9DD2110A2 mscorlib.dll!System.Threading.ExecutionContext.RunInternal
0x7FF9DD210F25 mscorlib.dll!System.Threading.ExecutionContext.Run
0x7FF9DD29AC8A mscorlib.dll!System.Threading.TimerQueueTimer.CallCallback
0x7FF9DD29AA7B mscorlib.dll!System.Threading.TimerQueueTimer.Fire
0x7FF9DD2283F8 mscorlib.dll!System.Threading.TimerQueue.FireNextTimers
```
Today, the `RuntimeMetricsWriter` creates a timer that doesn't repeat,
and fires `PushEvents()` after the delay. Inside the `PushEvents()`
method, after grabbing the metrics, it calculates the required delay
(accounting for drift) and reconfigures the timer. This ensures we don't
have concurrent collections, and we avoid drift as much as possible.
However, as best as we can tell, if the calculated delay required is
very small, it looks like this can handle the callback on the same
thread, and [we could end up in a recursive
situation](https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-queueuserapc):
> When a user-mode APC is queued, the thread is not directed to call the
APC function unless it is in an alertable state. After the thread is in
an alertable state, the thread handles all pending APCs in first in,
first out (FIFO) order, and the wait operation returns
WAIT_IO_COMPLETION. A thread enters an alertable state by using SleepEx
function, SignalObjectAndWait function, WaitForSingleObjectEx function,
WaitForMultipleObjectsEx function, or MsgWaitForMultipleObjectsEx
function.
>
> If an application queues an APC before the thread begins running, the
thread begins by calling the APC function. After the thread calls an APC
function, it calls the APC functions for all APCs in its APC queue.
>
> It is possible to sleep or wait for an object within the APC. **If you
perform an alertable wait inside an APC, it will recursively dispatch
the APCs. This can cause a stack overflow**.
## Implementation details
To avoid the possibility of recursion (and all the weirdness that comes
with timers), we switch away from the `Timer` approach, and instead use
a long-running task (i.e. dedicated thread) and just `Thread.Sleep` for
the delay. This is suboptimal because:
- We can't interrupt the `Thread.Sleep()`, so we have to sleep for small
periods, check if we're disposed and wake up. This will likely cause a
small amount of drift (the time to wake up and execute the loop logic).
We could account for this by setting a "deadline" for `DateTime.UtcNow`
instead of the "duration requirement". Both have pros and cons.
- It uses another dedicated thread (not from the thread pool)
We _could_ switch to making things async instead, and use the
`ThreadPool` but it's not clear whether all the kernel calls are a bit
time consuming for a threadpool thread 🤷♂️
Finally, we've only seen this issue in .NET Framework, so to try to
minimize the blast radius, this PR limits the change to .NET Framework.
## Test coverage
Should be covered by existing, as it's an internal refactoring
---------
Co-authored-by: NachoEchevarria <53266532+NachoEchevarria@users.noreply.github.com>
1 parent 309a8b7 commit 03b420d
File tree
1 file changed
+51
-4
lines changed- tracer/src/Datadog.Trace/RuntimeMetrics
1 file changed
+51
-4
lines changedLines changed: 51 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
| 12 | + | |
12 | 13 | | |
13 | 14 | | |
14 | 15 | | |
| |||
40 | 41 | | |
41 | 42 | | |
42 | 43 | | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
43 | 47 | | |
44 | | - | |
| 48 | + | |
45 | 49 | | |
46 | 50 | | |
47 | 51 | | |
| |||
122 | 126 | | |
123 | 127 | | |
124 | 128 | | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
125 | 135 | | |
| 136 | + | |
126 | 137 | | |
127 | 138 | | |
128 | 139 | | |
| |||
139 | 150 | | |
140 | 151 | | |
141 | 152 | | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
142 | 159 | | |
143 | 160 | | |
144 | 161 | | |
| |||
150 | 167 | | |
151 | 168 | | |
152 | 169 | | |
153 | | - | |
| 170 | + | |
154 | 171 | | |
155 | 172 | | |
156 | 173 | | |
| |||
160 | 177 | | |
161 | 178 | | |
162 | 179 | | |
163 | | - | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
164 | 190 | | |
165 | 191 | | |
166 | 192 | | |
167 | 193 | | |
168 | | - | |
| 194 | + | |
169 | 195 | | |
170 | 196 | | |
171 | 197 | | |
| |||
259 | 285 | | |
260 | 286 | | |
261 | 287 | | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
262 | 306 | | |
263 | 307 | | |
264 | 308 | | |
| |||
273 | 317 | | |
274 | 318 | | |
275 | 319 | | |
| 320 | + | |
276 | 321 | | |
| 322 | + | |
| 323 | + | |
277 | 324 | | |
278 | 325 | | |
279 | 326 | | |
| |||
0 commit comments