Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NO-TICKET] Fix profiler flaky test #3806

Merged
merged 2 commits into from
Jul 25, 2024
Merged

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Jul 24, 2024

What does this PR do?

This PR fixes a flaky test introduced by #3792:

Failures:

  1) Datadog::Profiling::Collectors::CpuAndWallTimeWorker#start when allocation profiling is enabled with dynamic_sampling_rate_enabled records skipped allocation samples when weights are clamped
     Failure/Error: raise('Wait time exhausted!')
     
     RuntimeError:
       Wait time exhausted!
     # ./spec/support/synchronization_helpers.rb:97:in `try_wait_until'
     # ./spec/datadog/profiling/collectors/cpu_and_wall_time_worker_spec.rb:554:in `block (5 levels) in <top (required)>'
     # ./spec/spec_helper.rb:230:in `block (2 levels) in <top (required)>'
     # ./spec/spec_helper.rb:115:in `block (2 levels) in <top (required)>'
     # /usr/local/bundle/gems/webmock-3.13.0/lib/webmock/rspec.rb:37:in `block (2 levels) in <top (required)>'
     # /usr/local/bundle/gems/rspec-wait-0.0.9/lib/rspec/wait.rb:46:in `block (2 levels) in <top (required)>'

  2) Datadog::Profiling::Collectors::CpuAndWallTimeWorker#start when all threads are sleeping (no thread holds the Global VM Lock) is able to sample even when all threads are sleeping
     Failure/Error:
       expect(simulated_signal_delivery.to_f / trigger_sample_attempts).to (be >= 0.8), \
         "Expected at least 80% of signals to be simulated, stats: #{stats}, debug_failures: #{debug_failures}"
     
       Expected at least 80% of signals to be simulated, stats: {:trigger_sample_attempts=>60, :trigger_simulated_signal_delivery_attempts=>0, :simulated_signal_delivery=>0, :signal_handler_enqueued_sample=>60, :signal_handler_wrong_thread=>0, :postponed_job_skipped_already_existed=>0, :postponed_job_success=>60, :postponed_job_full=>0, :postponed_job_unknown_result=>0, :interrupt_thread_attempts=>60, :cpu_sampled=>60, :cpu_skipped=>0, :cpu_effective_sample_rate=>1.0, :cpu_sampling_time_ns_min=>78969, :cpu_sampling_time_ns_max=>141799, :cpu_sampling_time_ns_total=>5846823, :cpu_sampling_time_ns_avg=>97447.05, :allocation_sampled=>nil, :allocation_skipped=>nil, :allocation_effective_sample_rate=>nil, :allocation_sampling_time_ns_min=>nil, :allocation_sampling_time_ns_max=>nil, :allocation_sampling_time_ns_total=>nil, :allocation_sampling_time_ns_avg=>nil, :allocation_sampler_snapshot=>nil, :allocations_during_sample=>nil}, debug_failures: {(omitted)}
     # ./spec/datadog/profiling/collectors/cpu_and_wall_time_worker_spec.rb:397:in `block (4 levels) in <top (required)>'
     # ./spec/spec_helper.rb:230:in `block (2 levels) in <top (required)>'
     # ./spec/spec_helper.rb:115:in `block (2 levels) in <top (required)>'
     # /usr/local/bundle/gems/webmock-3.13.0/lib/webmock/rspec.rb:37:in `block (2 levels) in <top (required)>'
     # /usr/local/bundle/gems/rspec-wait-0.0.9/lib/rspec/wait.rb:46:in `block (2 levels) in <top (required)>'

Finished in 8 minutes 55 seconds (files took 1.3 seconds to load)
758 examples, 2 failures, 19 pending

Failed examples:

rspec ./spec/datadog/profiling/collectors/cpu_and_wall_time_worker_spec.rb:549 # Datadog::Profiling::Collectors::CpuAndWallTimeWorker#start when allocation profiling is enabled with dynamic_sampling_rate_enabled records skipped allocation samples when weights are clamped
rspec ./spec/datadog/profiling/collectors/cpu_and_wall_time_worker_spec.rb:379 # Datadog::Profiling::Collectors::CpuAndWallTimeWorker#start when all threads are sleeping (no thread holds the Global VM Lock) is able to sample even when all threads are sleeping

Specifically, the first test failed due to 4d7cbce -- the profiler was able to "keep up" with the allocation rate of the Ruby VM and never had to skip samples.

The second test then failed because the first test leaked the thread_that_allocates_as_fast_as_possible. For the second test, we require that the Ruby VM is fully idle for a certain period, and the leaked thread never allowed the Ruby VM to be idle.

The debug output from the second test (this test already has a bunch of extra custom output on failure since it has flaked in the past, although not always by its fault) showed this:

#<struct ProfileHelpers::Sample locations=[
#<struct ProfileHelpers::Frame base_label="initialize", path="/app/spec/datadog/profiling/collectors/cpu_and_wall_time_worker_spec.rb", lineno=552>, 
...
], values={:"cpu-samples"=>55, :"wall-time"=>549571399, :"cpu-time"=>470431681, :"alloc-samples"=>0, :"alloc-samples-unscaled"=>0},
labels={:"thread id"=>"1888 (34120)", :"thread name"=>"/app/spec/spec_helper.rb:245", :state=>"had cpu"}>,

E.g. you can see that the thread started by the other spec was hot on the cpu (lots of cpu-time) and that's why the Ruby VM was not idle.

Motivation:

The profiler aims for zero flaky tests always! If you find a flaky test in the profiler, I don't know about it, please do let me know :)

Additional Notes:

Interestingly, our leaked threads detector did flag this issue:

      with dynamic_sampling_rate_enabled

Spec leaked 1 threads in "Datadog::Profiling::Collectors::CpuAndWallTimeWorker#start when allocation profiling is enabled with dynamic_sampling_rate_enabled records skipped allocation samples when weights are clamped".
Ensure all threads are terminated when test finishes.
For help fixing this issue, see "Ensuring tests don't leak resources" in docs/DevelopmentGuide.md.

...I somewhat wish we were already 100% clean of leaked threads so that we could make a thread leak an instant test suite failure.

How to test the change?

Validate that CI is beautiful green.

… testcase fails

This test failed in CI, which led other tests to start failing, as well
as making the overall test run really slow until it dragged itself over
the line.

To avoid this sharp edge, let's make sure to stop the background thread
even if the spec fails.
This test for the "Skipped Samples" placeholder failed in CI:
<https://app.circleci.com/pipelines/github/DataDog/dd-trace-rb/15605/workflows/5b32e4bb-155e-4a94-bf8b-980599eaafae/jobs/568892/tests#failed-test-0>

I strongly suspect the issue is that the profiler was able to "keep up"
with the Ruby VM and thus never needed to skip samples.

In fact, I can reproduce the same behavior by raising the
`dynamic_sampling_rate_overhead_target_percentage` on my machine.

Thus, to reduce flakiness, I'm lowering the target for the dynamic
sampling rate from 2% to 0.1% which should be tight enough to
consistently cause the profiler to need to skip samples in this test.
@ivoanjo ivoanjo requested a review from a team as a code owner July 24, 2024 11:37
@github-actions github-actions bot added the dev/testing Involves testing processes (e.g. RSpec) label Jul 24, 2024
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (master@9a7002c). Learn more about missing BASE report.
Report is 2 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##             master    #3806   +/-   ##
=========================================
  Coverage          ?   97.90%           
=========================================
  Files             ?     1261           
  Lines             ?    75620           
  Branches          ?     3706           
=========================================
  Hits              ?    74038           
  Misses            ?     1582           
  Partials          ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ivoanjo ivoanjo merged commit f5757c2 into master Jul 25, 2024
171 checks passed
@ivoanjo ivoanjo deleted the ivoanjo/fix-profiler-flaky-spec branch July 25, 2024 08:02
@github-actions github-actions bot added this to the 2.3.0 milestone Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dev/testing Involves testing processes (e.g. RSpec)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants