Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-10123] Add placeholder for skipped samples in allocation profiler, second try #3792

Merged

Commits on Jul 18, 2024

  1. [PROF-10123] Add placeholder for skipped samples in allocation profiler

    **What does this PR do?**
    
    This PR fixes a corner case issue with the allocation profiler: if too
    many objects get allocated, we "clamp" the weight of every individual
    sample (to avoid introducing bias) to a maximum value.
    
    But we previously did not in any way represent the clamping on the
    profile itself, which meant that the total number of allocations
    reported by the profiler would be missing any object counts that
    were clamped.
    
    In practice, we don't expect this to happen very often, as we have
    a high clamping value `MAX_ALLOC_WEIGHT = 65535` (it's in
    `collectors_cpu_and_wall_time_worker.c`).
    
    This PR then introduces an extra placeholder sample, when needed,
    to represent allocations that went over the clamped value.
    
    Thus the total amount of objects allocated reported by the profiler
    now becomes accurate, even in the presence of clamping.
    
    **Motivation:**
    
    We're considering lowering the maximum clamping value, which would mean
    it would be hit more often, and thus it's important to fix this issue
    before we do that.
    
    **Additional Notes:**
    
    N/A
    
    **How to test the change?**
    
    I've added test coverage for this change. I also used the following
    snippet to quickly experiment while I was developing this PR:
    
    ```ruby
    Datadog::Profiling.wait_until_running
    allocator = Thread.new { loop { BasicObject.new } }
    sleep 10
    allocator.kill; allocator.join
    puts "Allocated #{GC.stat(:total_allocated_objects)} objects total"
    ```
    ivoanjo committed Jul 18, 2024
    Configuration menu
    Copy the full SHA
    3d39c47 View commit details
    Browse the repository at this point in the history
  2. Avoid deadlocking in profiler when end_heap_allocation_recording ra…

    …ises
    
    Due to a bug in the integration between how the heap sampler works and
    the new "missing allocations/skipped samples" placeholder, we could
    cause one of the heap sampler sanity checks to fail, and thus raise
    an exception.
    
    This exception would be correctly logged, e.g. something like
    > W, [2024-07-08T11:24:21.212654 #226523]  WARN -- datadog: [datadog]
    > (lib/datadog/profiling/collectors/cpu_and_wall_time_worker.rb:72:in
    > `rescue in block (2 levels) in start') CpuAndWallTimeWorker thread
    > error. Cause: RuntimeError Ended a heap recording that was not
    > started Location: missing-testcase.rb:2:in `new'
    
    The exception would also cause the profiler to try to shut down.
    
    But this is where we would run into issues: the scheduler still tries
    to report one last profile before shutting down, and because the
    above exception left the stack recorder with the active slot mutex
    locked, this would make the scheduler deadlock forever, thus
    preventing the Ruby VM from shutting down.
    
    Part 1 on fixing this issue is to correctly unlock the mutex, thus
    getting rid of the deadlock.
    
    Part 2 (next commit) will be to not report a profile when the profiler
    stopped with a failure. This seems like a prudent way to go, since the
    profile may be in an inconsistent state.
    
    (Out of curiosity, both 1 or 2 would suffice to avoid the issue, but
    I think both seem useful individually to avoid future issues.)
    ivoanjo committed Jul 18, 2024
    Configuration menu
    Copy the full SHA
    cb03ed1 View commit details
    Browse the repository at this point in the history
  3. Do not try to flush profile when profiler failed

    When the worker component fails, we also stop the scheduler. But, up
    until now, if there was enough data to be reported, we would still try
    to serialize and report a last profile.
    
    This is not an entirely great idea, as the profile may not be in a
    consistent state. So, instead, this commit adds a flag to mark this
    situation and the scheduler will use it to skip any attempts to
    report data.
    ivoanjo committed Jul 18, 2024
    Configuration menu
    Copy the full SHA
    ca880f3 View commit details
    Browse the repository at this point in the history
  4. Disable heap profiling spec on legacy rubies

    Oooops! This spec definitely should not be running on legacy Rubies.
    
    It actually started to flake with a recent change, and it took me
    a while to actually remember that it shouldn't even be running in
    the first place.
    ivoanjo committed Jul 18, 2024
    Configuration menu
    Copy the full SHA
    22e6ae8 View commit details
    Browse the repository at this point in the history
  5. Fix placeholder for skipped samples in allocation profiler breaking h…

    …eap profiler, try 2
    
    Adding the "skipped samples" placeholder breaks the heap profiler, as
    it starts failing with:
    
    > WARN -- datadog: [datadog] CpuAndWallTimeWorker thread error. Cause:
    > RuntimeError Ended a heap recording that was not started
    > WARN -- datadog: [datadog] Detected issue with profiler (worker
    > component), stopping profiling.
    
    This is because the stack recorder up until now used the check for
    `alloc_samples > 0` as a proxy for "this is an allocation sample, and
    someone already called `track_object(...)` with it".
    
    The addition of the skipped samples placeholder broke this assumption,
    as it uses `alloc_samples > 0` but explicitly there's no object being
    tracked.
    
    To fix this issue, I've added a new argument to the `sample_values`
    that explicitly marks a sample as also being heap (or not). Thus,
    we can now explicitly tell the stack recorder which situation we are in
    using that flag, instead of trying to derive it from `alloc_samples >
    0`.
    
    This is actually the second attempt at fixing this. The first attempt
    got reverted along with
    #3775 because it tried to
    detect placeholders (and thus used a placeholder as a proxy for
    "someone called `track_object(...)`". This was incorrect because
    there's other placeholders than just the "skipped samples" (case in
    point: the "In native code" placeholder), which can happen together
    with an object allocation.
    ivoanjo committed Jul 18, 2024
    Configuration menu
    Copy the full SHA
    5c7bd5f View commit details
    Browse the repository at this point in the history
  6. Minor: Simplify time retrieval during worker initialization

    By getting the timestamp before allocating any memory, we can safely
    rely on the exception-raising behavior of our helper, rather than
    needing to handle clearing the state if we get a failure
    mid-initialization.
    
    Having a timestamp that's a few ns before is not expected to impact the
    discrete dynamic sampler behavior.
    ivoanjo committed Jul 18, 2024
    Configuration menu
    Copy the full SHA
    258e02b View commit details
    Browse the repository at this point in the history

Commits on Jul 24, 2024

  1. Adjust thread id for skipped samples

    At some point "SA" stood for "Skipped Allocations", but after a few
    renames elsewhere this is kinda confusing so let's make it "SS" for
    "Skipped Samples".
    ivoanjo committed Jul 24, 2024
    Configuration menu
    Copy the full SHA
    16005f4 View commit details
    Browse the repository at this point in the history