[PROF-10201] Reduce allocation profiling overhead by using coarse timestamps #3797

ivoanjo · 2024-07-22T10:50:56Z

What does this PR do?

This PR reduces the allocation profiling overhead (or: optimizes allocation profiling :D ) by using coarse timestamps on the on_newobj_event hot path.

Motivation:

When allocation profiling is enabled, the profiler gets called for almost every object allocated in the Ruby VM. Thus, this code path is extremely sensitive: the less work we can do before we return control over to Ruby, the less impact allocation profiling has on the application.

The dynamic sampling rate mechanism we employ takes the current timestamp as an input, to decide if "enough time" has elapsed since it last readjusted itself. But "enough time" right now is one second and thus we can get away with using CLOCK_MONOTONIC_COARSE on Linux which is noticeably cheaper than the regular CLOCK_MONOTONIC.

Additional Notes:

Enabling the use of the monotonic clock sometimes on the discrete dynamic sampler required it to "spill" some of its guts out to the caller, so that the caller could correctly use the coarse clock on the hot path.

How to test the change?

Here's my experiment to compare three different clock sources I evaluated:

 // Build with
 // `g++ time_sources.cpp -o time_sources -lbenchmark -lpthread`
 // where benchmark is <https://github.com/google/benchmark> aka
 // `apt install libbenchmark1 libbenchmark-dev` on ubuntu/debian

 #include <benchmark/benchmark.h>
 #include <x86intrin.h> // For __rdtsc
 #include <ctime>       // For clock_gettime

static void BM_RDTSC(benchmark::State& state) {
    for (auto _ : state) {
        benchmark::DoNotOptimize(__rdtsc());
    }
}

static void BM_ClockMonotonic(benchmark::State& state) {
    timespec ts;
    for (auto _ : state) {
        clock_gettime(CLOCK_MONOTONIC, &ts);
        benchmark::DoNotOptimize(ts);
    }
}

static void BM_ClockMonotonicCoarse(benchmark::State& state) {
    timespec ts;
    for (auto _ : state) {
        clock_gettime(CLOCK_MONOTONIC_COARSE, &ts);
        benchmark::DoNotOptimize(ts);
    }
}

BENCHMARK(BM_RDTSC);
BENCHMARK(BM_ClockMonotonic);
BENCHMARK(BM_ClockMonotonicCoarse);

BENCHMARK_MAIN();

Results on my machine:

./time_sources --benchmark_repetitions=10 --benchmark_report_aggregates_only=true
2024-07-19T10:48:20+01:00
Running ./time_sources
Run on (20 X 4900 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x10)
  L1 Instruction 32 KiB (x10)
  L2 Unified 1280 KiB (x10)
  L3 Unified 24576 KiB (x1)
Load Average: 1.23, 1.30, 1.11
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations
-------------------------------------------------------------------------
BM_RDTSC_mean                        5.52 ns         5.52 ns           10
BM_RDTSC_median                      5.44 ns         5.44 ns           10
BM_RDTSC_stddev                     0.148 ns        0.147 ns           10
BM_RDTSC_cv                          2.67 %          2.67 %            10
BM_ClockMonotonic_mean               15.8 ns         15.8 ns           10
BM_ClockMonotonic_median             15.4 ns         15.4 ns           10
BM_ClockMonotonic_stddev             1.07 ns         1.07 ns           10
BM_ClockMonotonic_cv                 6.77 %          6.77 %            10
BM_ClockMonotonicCoarse_mean         5.92 ns         5.92 ns           10
BM_ClockMonotonicCoarse_median       5.93 ns         5.93 ns           10
BM_ClockMonotonicCoarse_stddev      0.041 ns        0.041 ns           10
BM_ClockMonotonicCoarse_cv           0.68 %          0.68 %            10

and here's the result of running benchmarks/profiler_allocation.rb comparing master to this branch:

ruby 2.7.7p221 (2022-11-24 revision 168ec2b1e5) [x86_64-linux]
Warming up --------------------------------------
Allocations (baseline)   1.431M i/100ms
Calculating -------------------------------------
Allocations (baseline)   14.370M (± 2.0%) i/s -    144.541M in  10.062635s

Warming up --------------------------------------
Allocations (master)     1.014M i/100ms
Calculating -------------------------------------
Allocations (master)     10.165M (± 1.0%) i/s -    102.390M in  10.074151s

Warming up --------------------------------------
Allocations (coarse)     1.179M i/100ms
Calculating -------------------------------------
Allocations (coarse)     11.495M (± 2.5%) i/s -    115.573M in  10.059971s

Comparison:
Allocations (baseline): 14369960.1 i/s
Allocations (coarse): 11495418.2 i/s - 1.25x  slower
Allocations (master): 10164615.7 i/s - 1.41x  slower

I've specifically used Ruby 2.7 for this comparison since this benchmark had a lot more variance (including baseline) on latter Rubies, and I wanted to isolate the specific changes to this code path.

This will be used by the discrete dynamic sampler to make cheaper checks of when to re-evaluate if it should work again.

…ther functions

…estamps **What does this PR do?** This PR reduces the allocation profiling overhead (or: optimizes allocation profiling :D ) by using coarse timestamps on the `on_newobj_event` hot path. **Motivation:** When allocation profiling is enabled, the profiler gets called for almost every object allocated in the Ruby VM. Thus, this code path is extremely sensitive: the less work we can do before we return control over to Ruby, the less impact allocation profiling has on the application. The dynamic sampling rate mechanism we employ takes the current timestamp as an input, to decide if "enough time" has elapsed since it last readjusted itself. But "enough time" right now is *one second* and thus we can get away with using `CLOCK_MONOTONIC_COARSE` on Linux which is noticeably cheaper than the regular `CLOCK_MONOTONIC`. **Additional Notes:** Enabling the use of the monotonic clock _sometimes_ on the discrete dynamic sampler required it to "spill" some of its guts out to the caller, so that the caller could correctly use the coarse clock on the hot path. **How to test the change?** Here's my experiment to compare three different clock sources I evaluated: ```c++ // Build with // `g++ time_sources.cpp -o time_sources -lbenchmark -lpthread` // where benchmark is <https://github.com/google/benchmark> aka // `apt install libbenchmark1 libbenchmark-dev` on ubuntu/debian #include <benchmark/benchmark.h> #include <x86intrin.h> // For __rdtsc #include <ctime> // For clock_gettime static void BM_RDTSC(benchmark::State& state) { for (auto _ : state) { benchmark::DoNotOptimize(__rdtsc()); } } static void BM_ClockMonotonic(benchmark::State& state) { timespec ts; for (auto _ : state) { clock_gettime(CLOCK_MONOTONIC, &ts); benchmark::DoNotOptimize(ts); } } static void BM_ClockMonotonicCoarse(benchmark::State& state) { timespec ts; for (auto _ : state) { clock_gettime(CLOCK_MONOTONIC_COARSE, &ts); benchmark::DoNotOptimize(ts); } } BENCHMARK(BM_RDTSC); BENCHMARK(BM_ClockMonotonic); BENCHMARK(BM_ClockMonotonicCoarse); BENCHMARK_MAIN(); ``` Results on my machine: ``` ./time_sources --benchmark_repetitions=10 --benchmark_report_aggregates_only=true 2024-07-19T10:48:20+01:00 Running ./time_sources Run on (20 X 4900 MHz CPU s) CPU Caches: L1 Data 48 KiB (x10) L1 Instruction 32 KiB (x10) L2 Unified 1280 KiB (x10) L3 Unified 24576 KiB (x1) Load Average: 1.23, 1.30, 1.11 ***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead. ------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------- BM_RDTSC_mean 5.52 ns 5.52 ns 10 BM_RDTSC_median 5.44 ns 5.44 ns 10 BM_RDTSC_stddev 0.148 ns 0.147 ns 10 BM_RDTSC_cv 2.67 % 2.67 % 10 BM_ClockMonotonic_mean 15.8 ns 15.8 ns 10 BM_ClockMonotonic_median 15.4 ns 15.4 ns 10 BM_ClockMonotonic_stddev 1.07 ns 1.07 ns 10 BM_ClockMonotonic_cv 6.77 % 6.77 % 10 BM_ClockMonotonicCoarse_mean 5.92 ns 5.92 ns 10 BM_ClockMonotonicCoarse_median 5.93 ns 5.93 ns 10 BM_ClockMonotonicCoarse_stddev 0.041 ns 0.041 ns 10 BM_ClockMonotonicCoarse_cv 0.68 % 0.68 % 10 ``` and here's the result of running `benchmarks/profiler_allocation.rb` comparing master to this branch: ``` ruby 2.7.7p221 (2022-11-24 revision 168ec2b1e5) [x86_64-linux] Warming up -------------------------------------- Allocations (baseline) 1.431M i/100ms Calculating ------------------------------------- Allocations (baseline) 14.370M (± 2.0%) i/s - 144.541M in 10.062635s Warming up -------------------------------------- Allocations (master) 1.014M i/100ms Calculating ------------------------------------- Allocations (master) 10.165M (± 1.0%) i/s - 102.390M in 10.074151s Warming up -------------------------------------- Allocations (coarse) 1.179M i/100ms Calculating ------------------------------------- Allocations (coarse) 11.495M (± 2.5%) i/s - 115.573M in 10.059971s Comparison: Allocations (baseline): 14369960.1 i/s Allocations (coarse): 11495418.2 i/s - 1.25x slower Allocations (master): 10164615.7 i/s - 1.41x slower ``` I've specifically used Ruby 2.7 for this comparison since this benchmark had a lot more variance (including baseline) on latter Rubies, and I wanted to isolate the specific changes to this code path.

pr-commenter · 2024-07-22T10:57:14Z

Benchmarks

Benchmark execution time: 2024-07-22 10:57:07

Comparing candidate commit 446f2ad in PR branch ivoanjo/prof-10201-coarse-timestamps with baseline commit 40d80cd in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 10 metrics, 2 unstable metrics.

codecov-commenter · 2024-07-22T11:02:38Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.91%. Comparing base (40d80cd) to head (446f2ad).

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #3797   +/-   ##
=======================================
  Coverage   97.91%   97.91%           
=======================================
  Files        1248     1248           
  Lines       75192    75192           
  Branches     3638     3638           
=======================================
  Hits        73627    73627           
  Misses       1565     1565

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

**What does this PR do?** This PR introduces a new setting for the profiler: `profiling.advanced.allocation_counting_enabled` which controls the profiler allocation counting feature. This setting is off by default, enabling us to reduce allocation profiling overhead slightly. **Motivation:** We actually used to have this setting back in the ddtrace 1.x series. We introduced it in #2635 and removed it in #3281 -- by tieing it directly to allocation profiling. I decided to re-introduce it so we can disable this feature by default. **Additional Notes:** This PR sits atop #3797 because it makes sense to measure both overhead improvements together, but is otherwise completely independent. **How to test the change?** Here's the results for running the `profiler_allocation` benchmark: ``` ruby 2.7.7p221 (2022-11-24 revision 168ec2b1e5) [x86_64-linux] Warming up -------------------------------------- Allocations (baseline) 1.419M i/100ms Calculating ------------------------------------- Allocations (baseline) 14.535M (± 2.1%) i/s - 146.197M in 10.062717s Warming up -------------------------------------- Allocations (alloc_profiling_enabled) 1.122M i/100ms Calculating ------------------------------------- Allocations (alloc_profiling_enabled) 11.636M (± 2.2%) i/s - 116.679M in 10.032209s Warming up -------------------------------------- Allocations (alloc_profiling_disabled) 1.124M i/100ms Calculating ------------------------------------- Allocations (alloc_profiling_disabled) 11.866M (± 2.6%) i/s - 119.175M in 10.050580s Comparison: Allocations (baseline): 14534979.3 i/s Allocations (alloc_profiling_disabled): 11865926.7 i/s - 1.22x slower Allocations (alloc_profiling_enabled): 11635919.9 i/s - 1.25x slower ``` The difference is close to the margin of error; nevertheless this feature was showing up on the native profiler, and since it was on the hot path for allocation profiling, I think it's worth it.

marcotc

🏎️ 🔥

ivoanjo · 2024-07-24T08:30:43Z

Thanks for the review! :)

ivoanjo added 5 commits July 22, 2024 10:36

Introduce a new benchmark for allocation profiling

7f063ce

Small refactor to time helpers in preparation for adding coarse clocks

41f49a2

Add helper for accessing CLOCK_MONOTONIC_COARSE

fd11902

This will be used by the discrete dynamic sampler to make cheaper checks of when to re-evaluate if it should work again.

Use now_ns instead of now as argument name for consistency with o…

534a42e

…ther functions

ivoanjo requested review from a team as code owners July 22, 2024 10:50

github-actions bot added the profiling Involves Datadog profiling label Jul 22, 2024

ivoanjo mentioned this pull request Jul 22, 2024

[PROF-10201] Disable profiler allocation counting feature by default #3798

Merged

marcotc approved these changes Jul 23, 2024

View reviewed changes

ivoanjo merged commit 2cdc371 into master Jul 24, 2024
171 checks passed

ivoanjo deleted the ivoanjo/prof-10201-coarse-timestamps branch July 24, 2024 08:30

github-actions bot added this to the 2.3.0 milestone Jul 24, 2024

TonyCTHsu mentioned this pull request Aug 22, 2024

Bump to version 2.3.0 #3861

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROF-10201] Reduce allocation profiling overhead by using coarse timestamps #3797

[PROF-10201] Reduce allocation profiling overhead by using coarse timestamps #3797

ivoanjo commented Jul 22, 2024

pr-commenter bot commented Jul 22, 2024

codecov-commenter commented Jul 22, 2024

marcotc left a comment

ivoanjo commented Jul 24, 2024

[PROF-10201] Reduce allocation profiling overhead by using coarse timestamps #3797

[PROF-10201] Reduce allocation profiling overhead by using coarse timestamps #3797

Conversation

ivoanjo commented Jul 22, 2024

pr-commenter bot commented Jul 22, 2024

Benchmarks

codecov-commenter commented Jul 22, 2024

Codecov Report

marcotc left a comment

Choose a reason for hiding this comment

ivoanjo commented Jul 24, 2024