[PROF-10123] Add placeholder for skipped samples in allocation profiler #3760

ivoanjo · 2024-07-05T16:13:42Z

What does this PR do?

This PR fixes a corner case issue with the allocation profiler: if too many objects get allocated, we "clamp" the weight of every individual sample (to avoid introducing bias) to a maximum value.

But we previously did not in any way represent the clamping on the profile itself, which meant that the total number of allocations reported by the profiler would be missing any object counts that were clamped.

In practice, we don't expect this to happen very often, as we have a high clamping value MAX_ALLOC_WEIGHT = 65535 (it's in collectors_cpu_and_wall_time_worker.c).

This PR then introduces an extra placeholder sample, when needed, to represent allocations that went over the clamped value.

Thus the total amount of objects allocated reported by the profiler now becomes accurate, even in the presence of clamping.

Motivation:

We're considering lowering the maximum clamping value, which would mean it would be hit more often, and thus it's important to fix this issue before we do that.

Additional Notes:

N/A

How to test the change?

I've added test coverage for this change. I also used the following snippet to quickly experiment while I was developing this PR:

Datadog::Profiling.wait_until_running
allocator = Thread.new { loop { BasicObject.new } }
sleep 10
allocator.kill; allocator.join
puts "Allocated #{GC.stat(:total_allocated_objects)} objects total"

~~Here's how it looks (it's the orange frame on the left)~~:

Update with new placeholder name:

You can observe that in this testcase with a massive amount of allocations we would be missing around ~10% of allocation counts, which we now track correctly.

…ofiler **What does this PR do?** This PR fixes a corner case issue with the allocation profiler: if too many objects get allocated, we "clamp" the weight of every individual sample (to avoid introducing bias) to a maximum value. But we previously did not in any way represent the clamping on the profile itself, which meant that the total number of allocations reported by the profiler would be missing any object counts that were clamped. In practice, we don't expect this to happen very often, as we have a high clamping value `MAX_ALLOC_WEIGHT = 65535` (it's in `collectors_cpu_and_wall_time_worker.c`). This PR then introduces an extra placeholder sample, when needed, to represent allocations that went over the clamped value. Thus the total amount of objects allocated reported by the profiler now becomes accurate, even in the presence of clamping. **Motivation:** We're considering lowering the maximum clamping value, which would mean it would be hit more often, and thus it's important to fix this issue before we do that. **Additional Notes:** N/A **How to test the change?** I've added test coverage for this change. I also used the following snippet to quickly experiment while I was developing this PR: ```ruby Datadog::Profiling.wait_until_running allocator = Thread.new { loop { BasicObject.new } } sleep 10 allocator.kill; allocator.join puts "Allocated #{GC.stat(:total_allocated_objects)} objects total" ```

p-datadog

The "missing" allocations to me is confusing. What do you think about "skipped" instead?

ivoanjo · 2024-07-08T08:25:57Z

Thanks for the suggestion! I've replaced "Missing Allocations" with "Skipped Samples".

…name This was a great suggestion from the PR review.

We've seen this in a few other places in the codebase, so let's use the usual workaround.

codecov-commenter · 2024-07-08T09:02:23Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.91%. Comparing base (849c60c) to head (abdb523).
Report is 24 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff            @@
##           master    #3760    +/-   ##
========================================
  Coverage   97.91%   97.91%            
========================================
  Files        1241     1241            
  Lines       74632    74752   +120     
  Branches     3605     3607     +2     
========================================
+ Hits        73074    73196   +122     
+ Misses       1558     1556     -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…ises Due to a bug in the integration between how the heap sampler works and the new "missing allocations/skipped samples" placeholder, we could cause one of the heap sampler sanity checks to fail, and thus raise an exception. This exception would be correctly logged, e.g. something like > W, [2024-07-08T11:24:21.212654 #226523] WARN -- datadog: [datadog] > (lib/datadog/profiling/collectors/cpu_and_wall_time_worker.rb:72:in > `rescue in block (2 levels) in start') CpuAndWallTimeWorker thread > error. Cause: RuntimeError Ended a heap recording that was not > started Location: missing-testcase.rb:2:in `new' The exception would also cause the profiler to try to shut down. But this is where we would run into issues: the scheduler still tries to report one last profile before shutting down, and because the above exception left the stack recorder with the active slot mutex locked, this would make the scheduler deadlock forever, thus preventing the Ruby VM from shutting down. Part 1 on fixing this issue is to correctly unlock the mutex, thus getting rid of the deadlock. Part 2 (next commit) will be to not report a profile when the profiler stopped with a failure. This seems like a prudent way to go, since the profile may be in an inconsistent state. (Out of curiosity, both 1 or 2 would suffice to avoid the issue, but I think both seem useful individually to avoid future issues.)

When the worker component fails, we also stop the scheduler. But, up until now, if there was enough data to be reported, we would still try to serialize and report a last profile. This is not an entirely great idea, as the profile may not be in a consistent state. So, instead, this commit adds a flag to mark this situation and the scheduler will use it to skip any attempts to report data.

ivoanjo · 2024-07-08T12:01:37Z

Update: I've discovered this PR creates an incompatibility with the heap profiler. We didn't have tests for this interaction, so I'm working to fix the bug + add tests to make sure we have coverage for it.

…eap profiler I missed it on my original commits, but the placeholder for skipped samples broke the heap profiler, as it started failing with: > W, [2024-07-08T15:41:27.128532 #366230] WARN -- datadog: [datadog] > CpuAndWallTimeWorker thread error. Cause: RuntimeError Ended a heap > recording that was not started Location: > missing-testcase.rb:2:in `new' > W, [2024-07-08T15:41:27.128583 #366230] WARN -- datadog: [datadog] > Detected issue with profiler (worker component), stopping profiling. > See previous log messages for details. This is triggered by how our current heap profiler integrates with the rest of the profiler: it expects to be called twice for each allocation sample -- once before the stack is collected, and once after. The implementation of the placeholder for skipped samples correctly did not call the heap profiler before the placeholder stack got collected BUT we were still calling the heap profiler after. This caused its internal consistency checks to fail. The fix for this is to not call the heap profiler after the placeholder sample. I chose to add a new argument to `record_sample` which is not particularly pretty, but I guess until we come up with a better way to have the heap profiler integrate with the rest of the components, it'll do.

Oooops! This spec definitely should not be running on legacy Rubies. It actually started to flake with a recent change, and it took me a while to actually remember that it shouldn't even be running in the first place.

I saw some flakiness in CI, so this is a very soft attempt at avoiding it; let's see if it's enough or we'll need more changes.

…f-10123-placeholder-missing-allocations" **What does this PR do?** This PR reverts the change from #3760. While we still want to add this feature, the implementation in #3760 can cause heap profiler to fail with an exception. This failure also uncovered a few other potential bugs in our error handling code. Thus, to be on the safe side, and because we want to get a new release out, I'm reverting the whole PR, until I can re-open a fixed version. **Motivation:** Remove blockers from the 2.2.0 release. **Additional Notes:** This reverts commit 7732142, reversing changes made to ca006e9. **How to test the change?** Validate that CI is still green.

[PROF-3760] Revert "Merge pull request #3760 from DataDog/ivoanjo/prof-10123-placeholder-missing-allocations"

* master: (98 commits) DEBUG-2334 Remove old ddtrace require from tracing benchmarks (DataDog#3783) Avoid environment starts with number Add more variables for artifacts unpack `git fetch` before rev-parse Use ruby image that has `git` installed Provide versions Fixed: RSpec #log_deprecation matcher error when log is limited. Remove internal entry from changelog Update lockfiles for release 2.2.0 Bump version 2.1.0 to 2.2.0 Add 2.2.0 to CHANGELOG.md Add VSCode extension recommendations Add `ruby-lsp` gem [PROF-3760] Revert "Merge pull request DataDog#3760 from DataDog/ivoanjo/prof-10123-placeholder-missing-allocations" [PROF-10112] Also simplify actionview templates with three underscores remove duplicated comments Use ActiveSupport events to instrument ActiveSupport::Cache [PROF-10125] Track unscaled allocation counts in allocation profiler Bootstrap new profile value type for unscaled counts apply static analysis proposed fixes ...

ivoanjo requested a review from a team as a code owner July 5, 2024 16:13

github-actions bot added the profiling Involves Datadog profiling label Jul 5, 2024

p-datadog approved these changes Jul 7, 2024

View reviewed changes

ivoanjo changed the title ~~[PROF-10123] Add placeholder for missing allocations in allocation profiler~~ [PROF-10123] Add placeholder for skipped samples in allocation profiler Jul 8, 2024

ivoanjo added 2 commits July 8, 2024 09:30

Replace "Missing Allocations" with "Skipped Samples" for placeholder …

1446109

…name This was a great suggestion from the PR review.

Add workaround for buggy gcc warning

8c9939f

We've seen this in a few other places in the codebase, so let's use the usual workaround.

ivoanjo added 3 commits July 8, 2024 10:08

Minor: Rename a few left-behind items to the new "skipped samples" name

2d682c3

p-datadog approved these changes Jul 8, 2024

View reviewed changes

ivoanjo added 3 commits July 8, 2024 15:41

Disable heap profiling spec on legacy rubies

5da21c1

Oooops! This spec definitely should not be running on legacy Rubies. It actually started to flake with a recent change, and it took me a while to actually remember that it shouldn't even be running in the first place.

Tweak specs to try to avoid flakiness

abdb523

I saw some flakiness in CI, so this is a very soft attempt at avoiding it; let's see if it's enough or we'll need more changes.

ivoanjo merged commit 7732142 into master Jul 9, 2024
170 checks passed

ivoanjo deleted the ivoanjo/prof-10123-placeholder-missing-allocations branch July 9, 2024 13:44

github-actions bot added this to the 2.2.0 milestone Jul 9, 2024

ivoanjo mentioned this pull request Jul 10, 2024

[PROF-3760] Revert "Merge pull request #3760 from DataDog/ivoanjo/prof-10123-placeholder-missing-allocations" #3775

Merged

ivoanjo added a commit that referenced this pull request Jul 10, 2024

Merge pull request #3775 from DataDog/ivoanjo/prof-3760-temporary-revert

eff7b3a

[PROF-3760] Revert "Merge pull request #3760 from DataDog/ivoanjo/prof-10123-placeholder-missing-allocations"

ivoanjo restored the ivoanjo/prof-10123-placeholder-missing-allocations branch July 18, 2024 08:07

ivoanjo mentioned this pull request Jul 18, 2024

[PROF-10123] Add placeholder for skipped samples in allocation profiler, second try #3792

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROF-10123] Add placeholder for skipped samples in allocation profiler #3760

[PROF-10123] Add placeholder for skipped samples in allocation profiler #3760

ivoanjo commented Jul 5, 2024 •

edited

Loading

p-datadog left a comment

ivoanjo commented Jul 8, 2024

codecov-commenter commented Jul 8, 2024 •

edited

Loading

ivoanjo commented Jul 8, 2024

[PROF-10123] Add placeholder for skipped samples in allocation profiler #3760

[PROF-10123] Add placeholder for skipped samples in allocation profiler #3760

Conversation

ivoanjo commented Jul 5, 2024 • edited Loading

p-datadog left a comment

Choose a reason for hiding this comment

ivoanjo commented Jul 8, 2024

codecov-commenter commented Jul 8, 2024 • edited Loading

Codecov Report

ivoanjo commented Jul 8, 2024

ivoanjo commented Jul 5, 2024 •

edited

Loading

codecov-commenter commented Jul 8, 2024 •

edited

Loading