Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-10123] Add placeholder for skipped samples in allocation profiler #3760

Merged
merged 9 commits into from
Jul 9, 2024

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Jul 5, 2024

What does this PR do?

This PR fixes a corner case issue with the allocation profiler: if too many objects get allocated, we "clamp" the weight of every individual sample (to avoid introducing bias) to a maximum value.

But we previously did not in any way represent the clamping on the profile itself, which meant that the total number of allocations reported by the profiler would be missing any object counts that were clamped.

In practice, we don't expect this to happen very often, as we have a high clamping value MAX_ALLOC_WEIGHT = 65535 (it's in collectors_cpu_and_wall_time_worker.c).

This PR then introduces an extra placeholder sample, when needed, to represent allocations that went over the clamped value.

Thus the total amount of objects allocated reported by the profiler now becomes accurate, even in the presence of clamping.

Motivation:

We're considering lowering the maximum clamping value, which would mean it would be hit more often, and thus it's important to fix this issue before we do that.

Additional Notes:

N/A

How to test the change?

I've added test coverage for this change. I also used the following snippet to quickly experiment while I was developing this PR:

Datadog::Profiling.wait_until_running
allocator = Thread.new { loop { BasicObject.new } }
sleep 10
allocator.kill; allocator.join
puts "Allocated #{GC.stat(:total_allocated_objects)} objects total"

Here's how it looks (it's the orange frame on the left):

image

Update with new placeholder name:
image

You can observe that in this testcase with a massive amount of allocations we would be missing around ~10% of allocation counts, which we now track correctly.

…ofiler

**What does this PR do?**

This PR fixes a corner case issue with the allocation profiler: if too
many objects get allocated, we "clamp" the weight of every individual
sample (to avoid introducing bias) to a maximum value.

But we previously did not in any way represent the clamping on the
profile itself, which meant that the total number of allocations
reported by the profiler would be missing any object counts that
were clamped.

In practice, we don't expect this to happen very often, as we have
a high clamping value `MAX_ALLOC_WEIGHT = 65535` (it's in
`collectors_cpu_and_wall_time_worker.c`).

This PR then introduces an extra placeholder sample, when needed,
to represent allocations that went over the clamped value.

Thus the total amount of objects allocated reported by the profiler
now becomes accurate, even in the presence of clamping.

**Motivation:**

We're considering lowering the maximum clamping value, which would mean
it would be hit more often, and thus it's important to fix this issue
before we do that.

**Additional Notes:**

N/A

**How to test the change?**

I've added test coverage for this change. I also used the following
snippet to quickly experiment while I was developing this PR:

```ruby
Datadog::Profiling.wait_until_running
allocator = Thread.new { loop { BasicObject.new } }
sleep 10
allocator.kill; allocator.join
puts "Allocated #{GC.stat(:total_allocated_objects)} objects total"
```
@ivoanjo ivoanjo requested a review from a team as a code owner July 5, 2024 16:13
@github-actions github-actions bot added the profiling Involves Datadog profiling label Jul 5, 2024
Copy link
Contributor

@p-datadog p-datadog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "missing" allocations to me is confusing. What do you think about "skipped" instead?

@ivoanjo ivoanjo changed the title [PROF-10123] Add placeholder for missing allocations in allocation profiler [PROF-10123] Add placeholder for skipped samples in allocation profiler Jul 8, 2024
@ivoanjo
Copy link
Member Author

ivoanjo commented Jul 8, 2024

Thanks for the suggestion! I've replaced "Missing Allocations" with "Skipped Samples".

…name

This was a great suggestion from the PR review.
We've seen this in a few other places in the codebase, so let's use
the usual workaround.
@codecov-commenter
Copy link

codecov-commenter commented Jul 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.91%. Comparing base (849c60c) to head (abdb523).
Report is 24 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff            @@
##           master    #3760    +/-   ##
========================================
  Coverage   97.91%   97.91%            
========================================
  Files        1241     1241            
  Lines       74632    74752   +120     
  Branches     3605     3607     +2     
========================================
+ Hits        73074    73196   +122     
+ Misses       1558     1556     -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…ises

Due to a bug in the integration between how the heap sampler works and
the new "missing allocations/skipped samples" placeholder, we could
cause one of the heap sampler sanity checks to fail, and thus raise
an exception.

This exception would be correctly logged, e.g. something like
> W, [2024-07-08T11:24:21.212654 #226523]  WARN -- datadog: [datadog]
> (lib/datadog/profiling/collectors/cpu_and_wall_time_worker.rb:72:in
> `rescue in block (2 levels) in start') CpuAndWallTimeWorker thread
> error. Cause: RuntimeError Ended a heap recording that was not
> started Location: missing-testcase.rb:2:in `new'

The exception would also cause the profiler to try to shut down.

But this is where we would run into issues: the scheduler still tries
to report one last profile before shutting down, and because the
above exception left the stack recorder with the active slot mutex
locked, this would make the scheduler deadlock forever, thus
preventing the Ruby VM from shutting down.

Part 1 on fixing this issue is to correctly unlock the mutex, thus
getting rid of the deadlock.

Part 2 (next commit) will be to not report a profile when the profiler
stopped with a failure. This seems like a prudent way to go, since the
profile may be in an inconsistent state.

(Out of curiosity, both 1 or 2 would suffice to avoid the issue, but
I think both seem useful individually to avoid future issues.)
When the worker component fails, we also stop the scheduler. But, up
until now, if there was enough data to be reported, we would still try
to serialize and report a last profile.

This is not an entirely great idea, as the profile may not be in a
consistent state. So, instead, this commit adds a flag to mark this
situation and the scheduler will use it to skip any attempts to
report data.
@ivoanjo
Copy link
Member Author

ivoanjo commented Jul 8, 2024

Update: I've discovered this PR creates an incompatibility with the heap profiler. We didn't have tests for this interaction, so I'm working to fix the bug + add tests to make sure we have coverage for it.

…eap profiler

I missed it on my original commits, but the placeholder for skipped
samples broke the heap profiler, as it started failing with:

> W, [2024-07-08T15:41:27.128532 #366230]  WARN -- datadog: [datadog]
> CpuAndWallTimeWorker thread error. Cause: RuntimeError Ended a heap
> recording that was not started Location:
> missing-testcase.rb:2:in `new'
> W, [2024-07-08T15:41:27.128583 #366230]  WARN -- datadog: [datadog]
> Detected issue with profiler (worker component), stopping profiling.
> See previous log messages for details.

This is triggered by how our current heap profiler integrates with
the rest of the profiler: it expects to be called twice for each
allocation sample -- once before the stack is collected, and once
after.

The implementation of the placeholder for skipped samples correctly
did not call the heap profiler before the placeholder stack got
collected BUT we were still calling the heap profiler after.

This caused its internal consistency checks to fail.

The fix for this is to not call the heap profiler after the
placeholder sample. I chose to add a new argument to `record_sample`
which is not particularly pretty, but I guess until we come up with
a better way to have the heap profiler integrate with the rest of
the components, it'll do.
Oooops! This spec definitely should not be running on legacy Rubies.

It actually started to flake with a recent change, and it took me
a while to actually remember that it shouldn't even be running in
the first place.
I saw some flakiness in CI, so this is a very soft attempt at avoiding
it; let's see if it's enough or we'll need more changes.
@ivoanjo ivoanjo merged commit 7732142 into master Jul 9, 2024
170 checks passed
@ivoanjo ivoanjo deleted the ivoanjo/prof-10123-placeholder-missing-allocations branch July 9, 2024 13:44
@github-actions github-actions bot added this to the 2.2.0 milestone Jul 9, 2024
ivoanjo added a commit that referenced this pull request Jul 10, 2024
…f-10123-placeholder-missing-allocations"

**What does this PR do?**

This PR reverts the change from #3760.

While we still want to add this feature, the implementation in #3760
can cause heap profiler to fail with an exception.

This failure also uncovered a few other potential bugs in our error
handling code.

Thus, to be on the safe side, and because we want to get a new release
out, I'm reverting the whole PR, until I can re-open a fixed version.

**Motivation:**

Remove blockers from the 2.2.0 release.

**Additional Notes:**

This reverts commit 7732142, reversing
changes made to ca006e9.

**How to test the change?**

Validate that CI is still green.
ivoanjo added a commit that referenced this pull request Jul 10, 2024
[PROF-3760] Revert "Merge pull request #3760 from DataDog/ivoanjo/prof-10123-placeholder-missing-allocations"
p-datadog pushed a commit to p-datadog/dd-trace-rb that referenced this pull request Jul 16, 2024
* master: (98 commits)
  DEBUG-2334 Remove old ddtrace require from tracing benchmarks (DataDog#3783)
  Avoid environment starts with number
  Add more variables for artifacts unpack
  `git fetch` before rev-parse
  Use ruby image that has `git` installed
  Provide versions
  Fixed: RSpec #log_deprecation matcher error when log is limited.
  Remove internal entry from changelog
  Update lockfiles for release 2.2.0
  Bump version 2.1.0 to 2.2.0
  Add 2.2.0 to CHANGELOG.md
  Add VSCode extension recommendations
  Add `ruby-lsp` gem
  [PROF-3760] Revert "Merge pull request DataDog#3760 from DataDog/ivoanjo/prof-10123-placeholder-missing-allocations"
  [PROF-10112] Also simplify actionview templates with three underscores
  remove duplicated comments
  Use ActiveSupport events to instrument ActiveSupport::Cache
  [PROF-10125] Track unscaled allocation counts in allocation profiler
  Bootstrap new profile value type for unscaled counts
  apply static analysis proposed fixes
  ...
@ivoanjo ivoanjo restored the ivoanjo/prof-10123-placeholder-missing-allocations branch July 18, 2024 08:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants