Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-10422] Fix flaky GVL profiling integration spec #3936

Merged
merged 1 commit into from
Sep 20, 2024

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Sep 20, 2024

What does this PR do?

This PR fixes a flaky specs seen in
https://app.circleci.com/pipelines/github/DataDog/dd-trace-rb/16487/workflows/df69cfde-1fd5-40dc-bff6-ac8a9d0bc713/jobs/593534 :

  1) Datadog::Profiling::Collectors::CpuAndWallTimeWorker#start when main thread is sleeping but a background thread is working when GVL profiling is enabled records Waiting for GVL samples
     Failure/Error: expect(waiting_for_gvl_time).to be_within(5).percent_of(total_time)
       expected 402832467 to be within 5% of 492191168

I was able to reproduce this failure on my local machine in around 1 in 20 runs.

This flakiness is caused by the profiler starting after a Waiting for GVL already started, and thus the profiler does not see it and still categorizes that period as "unknown"; thus our tight assertion is blown.

It's not a coincidence that the difference between 492191168 and 402832467 is within ~100ms: that's the scheduler latency caused by the background thread that's burning CPU.

To fix this issue, we subtract the duration of this initial period from the total_time: this is the period we care for.

To avoid us accidentally passing this test by having no Waiting for GVL, I've added an additional assertion for a sane total_value.

Motivation:

Make sure the profiler has zero flaky specs.

Additional Notes:

This PR is on top of #3934 for easier testing, but it's otherwise independent of it.

How to test the change?

Validate that CI is green and that the flaky test does not show up again.

**What does this PR do?**

This PR fixes a flaky specs seen in
https://app.circleci.com/pipelines/github/DataDog/dd-trace-rb/16487/workflows/df69cfde-1fd5-40dc-bff6-ac8a9d0bc713/jobs/593534 :

```
  1) Datadog::Profiling::Collectors::CpuAndWallTimeWorker#start when main thread is sleeping but a background thread is working when GVL profiling is enabled records Waiting for GVL samples
     Failure/Error: expect(waiting_for_gvl_time).to be_within(5).percent_of(total_time)
       expected 402832467 to be within 5% of 492191168
```

I was able to reproduce this failure on my local machine in around 1 in
20 runs.

This flakiness is caused by the profiler starting after a Waiting for
GVL already started, and thus the profiler does not see it and still
categorizes that period as "unknown"; thus our tight assertion is blown.

It's not a concidence that the difference between 492191168 and
402832467 is within ~100ms: that's the scheduler latency caused
by the background thread that's burning CPU.

To fix this issue, we subtract the duration of this initial period
from the `total_time`: this is the period we care for.

To avoid us accidentally passing this test by having no Waiting for GVL,
I've added an additional assertion for a sane `total_value`.

**Motivation:**

Make sure the profiler has zero flaky specs.

**Additional Notes:**

N/A

**How to test the change?**

Validate that CI is green and that the flaky test does not show up
again.
@ivoanjo ivoanjo requested a review from a team as a code owner September 20, 2024 11:48
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.77%. Comparing base (c8d5351) to head (be8882a).

Additional details and impacted files
@@                             Coverage Diff                              @@
##           ivoanjo/fix-gvl-profling-exception-flaky    #3936      +/-   ##
============================================================================
- Coverage                                     97.77%   97.77%   -0.01%     
============================================================================
  Files                                          1297     1297              
  Lines                                         77973    77975       +2     
  Branches                                       3888     3888              
============================================================================
  Hits                                          76240    76240              
- Misses                                         1733     1735       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pr-commenter
Copy link

pr-commenter bot commented Sep 20, 2024

Benchmarks

Benchmark execution time: 2024-09-20 12:23:44

Comparing candidate commit be8882a in PR branch ivoanjo/prof-10422-gvl-flaky-period with baseline commit c8d5351 in branch ivoanjo/fix-gvl-profling-exception-flaky.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 23 metrics, 2 unstable metrics.

Base automatically changed from ivoanjo/fix-gvl-profling-exception-flaky to master September 20, 2024 13:50
@ivoanjo ivoanjo merged commit a91f0c4 into master Sep 20, 2024
191 checks passed
@ivoanjo ivoanjo deleted the ivoanjo/prof-10422-gvl-flaky-period branch September 20, 2024 13:51
@github-actions github-actions bot added this to the 2.4.0 milestone Sep 20, 2024
@ivoanjo ivoanjo added profiling Involves Datadog profiling dev/internal Other internal work that does not need to be included in the changelog dev/testing Involves testing processes (e.g. RSpec) labels Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dev/internal Other internal work that does not need to be included in the changelog dev/testing Involves testing processes (e.g. RSpec) profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants