From be8882ace3312786f021736820ddbe20322d6411 Mon Sep 17 00:00:00 2001 From: Ivo Anjo Date: Fri, 20 Sep 2024 12:40:35 +0100 Subject: [PATCH] [PROF-10422] Fix flaky GVL profiling integration spec **What does this PR do?** This PR fixes a flaky specs seen in https://app.circleci.com/pipelines/github/DataDog/dd-trace-rb/16487/workflows/df69cfde-1fd5-40dc-bff6-ac8a9d0bc713/jobs/593534 : ``` 1) Datadog::Profiling::Collectors::CpuAndWallTimeWorker#start when main thread is sleeping but a background thread is working when GVL profiling is enabled records Waiting for GVL samples Failure/Error: expect(waiting_for_gvl_time).to be_within(5).percent_of(total_time) expected 402832467 to be within 5% of 492191168 ``` I was able to reproduce this failure on my local machine in around 1 in 20 runs. This flakiness is caused by the profiler starting after a Waiting for GVL already started, and thus the profiler does not see it and still categorizes that period as "unknown"; thus our tight assertion is blown. It's not a concidence that the difference between 492191168 and 402832467 is within ~100ms: that's the scheduler latency caused by the background thread that's burning CPU. To fix this issue, we subtract the duration of this initial period from the `total_time`: this is the period we care for. To avoid us accidentally passing this test by having no Waiting for GVL, I've added an additional assertion for a sane `total_value`. **Motivation:** Make sure the profiler has zero flaky specs. **Additional Notes:** N/A **How to test the change?** Validate that CI is green and that the flaky test does not show up again. --- .../collectors/cpu_and_wall_time_worker_spec.rb | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/spec/datadog/profiling/collectors/cpu_and_wall_time_worker_spec.rb b/spec/datadog/profiling/collectors/cpu_and_wall_time_worker_spec.rb index b93765fbec..bd51b87dd2 100644 --- a/spec/datadog/profiling/collectors/cpu_and_wall_time_worker_spec.rb +++ b/spec/datadog/profiling/collectors/cpu_and_wall_time_worker_spec.rb @@ -449,16 +449,25 @@ samples = samples_for_thread( samples_from_pprof_without_gc_and_overhead(recorder.serialize!), background_thread_affected_by_gvl_contention - ) + ).sort_by { |s| s.labels.fetch(:end_timestamp_ns) } + + # Because the background_thread_affected_by_gvl_contention starts BEFORE the profiler, the first few samples + # will have an unknown state because the profiler may have missed the beginning of the Waiting for GVL + # + # So that the below assertions make sense (and are not flaky), we drop these first few samples from our + # consideration + missed_by_profiler_time = + samples.take_while { |s| s.labels[:state] == "unknown" }.sum { |sample| sample.values.fetch(:"wall-time") } waiting_for_gvl_samples = samples.select { |sample| sample.labels[:state] == "waiting for gvl" } - total_time = samples.sum { |sample| sample.values.fetch(:"wall-time") } + total_time = samples.sum { |sample| sample.values.fetch(:"wall-time") } - missed_by_profiler_time waiting_for_gvl_time = waiting_for_gvl_samples.sum { |sample| sample.values.fetch(:"wall-time") } expect(waiting_for_gvl_samples.size).to be > 0 # The background thread should spend almost all of its time waiting to run (since when it gets to run # it just passes and starts waiting) + expect(total_time).to be >= 200_000_000 # This test should run for at least 200ms, which is how long we sleep for expect(waiting_for_gvl_time).to be < total_time expect(waiting_for_gvl_time).to be_within(5).percent_of(total_time)