-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROF-10123] Add placeholder for skipped samples in allocation profiler #3760
[PROF-10123] Add placeholder for skipped samples in allocation profiler #3760
Conversation
…ofiler **What does this PR do?** This PR fixes a corner case issue with the allocation profiler: if too many objects get allocated, we "clamp" the weight of every individual sample (to avoid introducing bias) to a maximum value. But we previously did not in any way represent the clamping on the profile itself, which meant that the total number of allocations reported by the profiler would be missing any object counts that were clamped. In practice, we don't expect this to happen very often, as we have a high clamping value `MAX_ALLOC_WEIGHT = 65535` (it's in `collectors_cpu_and_wall_time_worker.c`). This PR then introduces an extra placeholder sample, when needed, to represent allocations that went over the clamped value. Thus the total amount of objects allocated reported by the profiler now becomes accurate, even in the presence of clamping. **Motivation:** We're considering lowering the maximum clamping value, which would mean it would be hit more often, and thus it's important to fix this issue before we do that. **Additional Notes:** N/A **How to test the change?** I've added test coverage for this change. I also used the following snippet to quickly experiment while I was developing this PR: ```ruby Datadog::Profiling.wait_until_running allocator = Thread.new { loop { BasicObject.new } } sleep 10 allocator.kill; allocator.join puts "Allocated #{GC.stat(:total_allocated_objects)} objects total" ```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "missing" allocations to me is confusing. What do you think about "skipped" instead?
Thanks for the suggestion! I've replaced "Missing Allocations" with "Skipped Samples". |
…name This was a great suggestion from the PR review.
We've seen this in a few other places in the codebase, so let's use the usual workaround.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #3760 +/- ##
========================================
Coverage 97.91% 97.91%
========================================
Files 1241 1241
Lines 74632 74752 +120
Branches 3605 3607 +2
========================================
+ Hits 73074 73196 +122
+ Misses 1558 1556 -2 ☔ View full report in Codecov by Sentry. |
…ises Due to a bug in the integration between how the heap sampler works and the new "missing allocations/skipped samples" placeholder, we could cause one of the heap sampler sanity checks to fail, and thus raise an exception. This exception would be correctly logged, e.g. something like > W, [2024-07-08T11:24:21.212654 #226523] WARN -- datadog: [datadog] > (lib/datadog/profiling/collectors/cpu_and_wall_time_worker.rb:72:in > `rescue in block (2 levels) in start') CpuAndWallTimeWorker thread > error. Cause: RuntimeError Ended a heap recording that was not > started Location: missing-testcase.rb:2:in `new' The exception would also cause the profiler to try to shut down. But this is where we would run into issues: the scheduler still tries to report one last profile before shutting down, and because the above exception left the stack recorder with the active slot mutex locked, this would make the scheduler deadlock forever, thus preventing the Ruby VM from shutting down. Part 1 on fixing this issue is to correctly unlock the mutex, thus getting rid of the deadlock. Part 2 (next commit) will be to not report a profile when the profiler stopped with a failure. This seems like a prudent way to go, since the profile may be in an inconsistent state. (Out of curiosity, both 1 or 2 would suffice to avoid the issue, but I think both seem useful individually to avoid future issues.)
When the worker component fails, we also stop the scheduler. But, up until now, if there was enough data to be reported, we would still try to serialize and report a last profile. This is not an entirely great idea, as the profile may not be in a consistent state. So, instead, this commit adds a flag to mark this situation and the scheduler will use it to skip any attempts to report data.
Update: I've discovered this PR creates an incompatibility with the heap profiler. We didn't have tests for this interaction, so I'm working to fix the bug + add tests to make sure we have coverage for it. |
…eap profiler I missed it on my original commits, but the placeholder for skipped samples broke the heap profiler, as it started failing with: > W, [2024-07-08T15:41:27.128532 #366230] WARN -- datadog: [datadog] > CpuAndWallTimeWorker thread error. Cause: RuntimeError Ended a heap > recording that was not started Location: > missing-testcase.rb:2:in `new' > W, [2024-07-08T15:41:27.128583 #366230] WARN -- datadog: [datadog] > Detected issue with profiler (worker component), stopping profiling. > See previous log messages for details. This is triggered by how our current heap profiler integrates with the rest of the profiler: it expects to be called twice for each allocation sample -- once before the stack is collected, and once after. The implementation of the placeholder for skipped samples correctly did not call the heap profiler before the placeholder stack got collected BUT we were still calling the heap profiler after. This caused its internal consistency checks to fail. The fix for this is to not call the heap profiler after the placeholder sample. I chose to add a new argument to `record_sample` which is not particularly pretty, but I guess until we come up with a better way to have the heap profiler integrate with the rest of the components, it'll do.
Oooops! This spec definitely should not be running on legacy Rubies. It actually started to flake with a recent change, and it took me a while to actually remember that it shouldn't even be running in the first place.
I saw some flakiness in CI, so this is a very soft attempt at avoiding it; let's see if it's enough or we'll need more changes.
…f-10123-placeholder-missing-allocations" **What does this PR do?** This PR reverts the change from #3760. While we still want to add this feature, the implementation in #3760 can cause heap profiler to fail with an exception. This failure also uncovered a few other potential bugs in our error handling code. Thus, to be on the safe side, and because we want to get a new release out, I'm reverting the whole PR, until I can re-open a fixed version. **Motivation:** Remove blockers from the 2.2.0 release. **Additional Notes:** This reverts commit 7732142, reversing changes made to ca006e9. **How to test the change?** Validate that CI is still green.
[PROF-3760] Revert "Merge pull request #3760 from DataDog/ivoanjo/prof-10123-placeholder-missing-allocations"
* master: (98 commits) DEBUG-2334 Remove old ddtrace require from tracing benchmarks (DataDog#3783) Avoid environment starts with number Add more variables for artifacts unpack `git fetch` before rev-parse Use ruby image that has `git` installed Provide versions Fixed: RSpec #log_deprecation matcher error when log is limited. Remove internal entry from changelog Update lockfiles for release 2.2.0 Bump version 2.1.0 to 2.2.0 Add 2.2.0 to CHANGELOG.md Add VSCode extension recommendations Add `ruby-lsp` gem [PROF-3760] Revert "Merge pull request DataDog#3760 from DataDog/ivoanjo/prof-10123-placeholder-missing-allocations" [PROF-10112] Also simplify actionview templates with three underscores remove duplicated comments Use ActiveSupport events to instrument ActiveSupport::Cache [PROF-10125] Track unscaled allocation counts in allocation profiler Bootstrap new profile value type for unscaled counts apply static analysis proposed fixes ...
What does this PR do?
This PR fixes a corner case issue with the allocation profiler: if too many objects get allocated, we "clamp" the weight of every individual sample (to avoid introducing bias) to a maximum value.
But we previously did not in any way represent the clamping on the profile itself, which meant that the total number of allocations reported by the profiler would be missing any object counts that were clamped.
In practice, we don't expect this to happen very often, as we have a high clamping value
MAX_ALLOC_WEIGHT = 65535
(it's incollectors_cpu_and_wall_time_worker.c
).This PR then introduces an extra placeholder sample, when needed, to represent allocations that went over the clamped value.
Thus the total amount of objects allocated reported by the profiler now becomes accurate, even in the presence of clamping.
Motivation:
We're considering lowering the maximum clamping value, which would mean it would be hit more often, and thus it's important to fix this issue before we do that.
Additional Notes:
N/A
How to test the change?
I've added test coverage for this change. I also used the following snippet to quickly experiment while I was developing this PR:
Here's how it looks (it's the orange frame on the left):You can observe that in this testcase with a massive amount of allocations we would be missing around ~10% of allocation counts, which we now track correctly.