Skip to content

Commit 03b420d

Browse files
Restructure RuntimeMetricsWriter for .NET Framework to fix StackOverflowException (#7830)
## Summary of changes Avoid having the callback from a `Timer` reconfigure itself ## Reason for change We have seem [some instances of StackOverflowExceptions](https://app.datadoghq.com/error-tracking?query=source%3Adotnet%20service%3Ainstrumentation-telemetry-data%20%40tags.crash_datadog%3Atrue%20%40library_version.major%3A3%20-%40tags.crash_profiler%3Atrue%20-%40tags.crash_appsec%3Atrue%20-%40tags.crash_runtime_metrics%3Atrue&et-side=data&link_source=monitor_notif&monitor_id=236533719&monitor_sub_type=.new%28%29&source=all&sp=%5B%7B%22p%22%3A%7B%22issueId%22%3A%22664a7966-c3a5-11f0-b289-da7ad0900002%22%7D%2C%22i%22%3A%22error-tracking-issue%22%7D%5D&from_ts=1763291319000&to_ts=1763377719000&live=false) that point the finger at the `RuntimeMetricsWriter` and the way it handles `PushEvents`: ``` 0x7FF9817A3F64 Datadog.Trace.dll!Datadog.Trace.RuntimeMetrics.RuntimeMetricsWriter.PushEvents 0x7FF980CC7FD7 Datadog.Trace.dll!Datadog.Trace.RuntimeMetrics.RuntimeMetricsWriter.GetCurrentProcessMetrics 0x7FF980CC81DC Datadog.Trace.dll!Datadog.Trace.RuntimeMetrics.ProcessSnapshotRuntimeInformation.GetCurrentProcessMetrics 0x7FF980CC7ECD Datadog.Trace.dll!Datadog.Trace.RuntimeMetrics.RuntimeMetricsWriter.GetCurrentProcessMetrics 0x7FF9817A392C Datadog.Trace.dll!Datadog.Trace.RuntimeMetrics.RuntimeMetricsWriter.PushEvents 0x7FF9DD2110A2 mscorlib.dll!System.Threading.ExecutionContext.RunInternal 0x7FF9DD210F25 mscorlib.dll!System.Threading.ExecutionContext.Run 0x7FF9DD29AC8A mscorlib.dll!System.Threading.TimerQueueTimer.CallCallback 0x7FF9DD29AA7B mscorlib.dll!System.Threading.TimerQueueTimer.Fire 0x7FF9DD2283F8 mscorlib.dll!System.Threading.TimerQueue.FireNextTimers ``` Today, the `RuntimeMetricsWriter` creates a timer that doesn't repeat, and fires `PushEvents()` after the delay. Inside the `PushEvents()` method, after grabbing the metrics, it calculates the required delay (accounting for drift) and reconfigures the timer. This ensures we don't have concurrent collections, and we avoid drift as much as possible. However, as best as we can tell, if the calculated delay required is very small, it looks like this can handle the callback on the same thread, and [we could end up in a recursive situation](https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-queueuserapc): > When a user-mode APC is queued, the thread is not directed to call the APC function unless it is in an alertable state. After the thread is in an alertable state, the thread handles all pending APCs in first in, first out (FIFO) order, and the wait operation returns WAIT_IO_COMPLETION. A thread enters an alertable state by using SleepEx function, SignalObjectAndWait function, WaitForSingleObjectEx function, WaitForMultipleObjectsEx function, or MsgWaitForMultipleObjectsEx function. > > If an application queues an APC before the thread begins running, the thread begins by calling the APC function. After the thread calls an APC function, it calls the APC functions for all APCs in its APC queue. > > It is possible to sleep or wait for an object within the APC. **If you perform an alertable wait inside an APC, it will recursively dispatch the APCs. This can cause a stack overflow**. ## Implementation details To avoid the possibility of recursion (and all the weirdness that comes with timers), we switch away from the `Timer` approach, and instead use a long-running task (i.e. dedicated thread) and just `Thread.Sleep` for the delay. This is suboptimal because: - We can't interrupt the `Thread.Sleep()`, so we have to sleep for small periods, check if we're disposed and wake up. This will likely cause a small amount of drift (the time to wake up and execute the loop logic). We could account for this by setting a "deadline" for `DateTime.UtcNow` instead of the "duration requirement". Both have pros and cons. - It uses another dedicated thread (not from the thread pool) We _could_ switch to making things async instead, and use the `ThreadPool` but it's not clear whether all the kernel calls are a bit time consuming for a threadpool thread 🤷‍♂️ Finally, we've only seen this issue in .NET Framework, so to try to minimize the blast radius, this PR limits the change to .NET Framework. ## Test coverage Should be covered by existing, as it's an internal refactoring --------- Co-authored-by: NachoEchevarria <53266532+NachoEchevarria@users.noreply.github.com>
1 parent 309a8b7 commit 03b420d

File tree

1 file changed

+51
-4
lines changed

1 file changed

+51
-4
lines changed

tracer/src/Datadog.Trace/RuntimeMetrics/RuntimeMetricsWriter.cs

Lines changed: 51 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
using System.Runtime.CompilerServices;
1010
using System.Runtime.ExceptionServices;
1111
using System.Threading;
12+
using System.Threading.Tasks;
1213
using Datadog.Trace.DogStatsd;
1314
using Datadog.Trace.Logging;
1415
using Datadog.Trace.Vendors.StatsdClient;
@@ -40,8 +41,11 @@ internal class RuntimeMetricsWriter : IDisposable
4041

4142
private readonly TimeSpan _delay;
4243

44+
#if NETFRAMEWORK
45+
private readonly Task _pushEventsTask;
46+
#else
4347
private readonly Timer _timer;
44-
48+
#endif
4549
private readonly IRuntimeMetricsListener _listener;
4650

4751
private readonly bool _enableProcessMetrics;
@@ -122,7 +126,14 @@ internal RuntimeMetricsWriter(IStatsdManager statsd, TimeSpan delay, bool inAzur
122126
Log.Warning(ex, "Unable to initialize runtime listener, some runtime metrics will be missing");
123127
}
124128

129+
#if NETFRAMEWORK
130+
// This delay is set to infinite in tests, so don't start the loop in that case
131+
_pushEventsTask = delay != Timeout.InfiniteTimeSpan
132+
? Task.Factory.StartNew(PushEventsLoop, TaskCreationOptions.LongRunning)
133+
: Task.CompletedTask;
134+
#else
125135
_timer = new Timer(_ => PushEvents(), null, delay, Timeout.InfiniteTimeSpan);
136+
#endif
126137
}
127138

128139
/// <summary>
@@ -139,6 +150,12 @@ public void Dispose()
139150
}
140151

141152
Log.Debug("Disposing Runtime Metrics timer");
153+
#if NETFRAMEWORK
154+
if (!_pushEventsTask.Wait(TimeSpan.FromMilliseconds(5_000)))
155+
{
156+
Log.Warning("Failed to dispose Runtime Metrics timer after 5 seconds");
157+
}
158+
#else
142159
// Callbacks can occur after the Dispose() method overload has been called,
143160
// because the timer queues callbacks for execution by thread pool threads.
144161
// Using the Dispose(WaitHandle) method overload to waits until all callbacks have completed.
@@ -150,7 +167,7 @@ public void Dispose()
150167
Log.Warning("Failed to dispose Runtime Metrics timer after 5 seconds");
151168
}
152169
}
153-
170+
#endif
154171
Log.Debug("Disposing other resources for Runtime Metrics");
155172
AppDomain.CurrentDomain.FirstChanceException -= FirstChanceException;
156173
// We don't dispose runtime metrics on .NET Core because of https://github.com/dotnet/runtime/issues/103480
@@ -160,12 +177,21 @@ public void Dispose()
160177
_exceptionCounts.Clear();
161178
}
162179

163-
internal void PushEvents()
180+
#if NETFRAMEWORK
181+
internal void PushEventsLoop()
182+
{
183+
while (PushEvents())
184+
{
185+
}
186+
}
187+
#endif
188+
189+
internal bool PushEvents()
164190
{
165191
if (Volatile.Read(ref _disposed) == 1)
166192
{
167193
Log.Debug("Runtime metrics is disposed and can't push new events");
168-
return;
194+
return false;
169195
}
170196

171197
var now = DateTime.UtcNow;
@@ -259,6 +285,24 @@ internal void PushEvents()
259285
{
260286
var callbackExecutionDuration = DateTime.UtcNow - now;
261287

288+
#if NETFRAMEWORK
289+
// Ideally we'd wait for the full time, but we need to make sure we shutdown in a relatively timely fashion
290+
const int loopDurationMs = 200;
291+
var newDelay = (int)(_delay - callbackExecutionDuration).TotalMilliseconds;
292+
293+
// Missed it, so just reset
294+
if (newDelay <= 0)
295+
{
296+
newDelay = (int)_delay.TotalMilliseconds;
297+
}
298+
299+
while (newDelay > 0 && Volatile.Read(ref _disposed) == 0)
300+
{
301+
var sleepDuration = Math.Min(newDelay, loopDurationMs);
302+
Thread.Sleep(sleepDuration);
303+
newDelay -= sleepDuration;
304+
}
305+
#else
262306
var newDelay = _delay - callbackExecutionDuration;
263307

264308
if (newDelay < TimeSpan.Zero)
@@ -273,7 +317,10 @@ internal void PushEvents()
273317
catch (ObjectDisposedException)
274318
{
275319
}
320+
#endif
276321
}
322+
323+
return true;
277324
}
278325

279326
private static IRuntimeMetricsListener InitializeListener(IStatsdManager statsd, TimeSpan delay, bool inAzureAppServiceContext)

0 commit comments

Comments
 (0)