-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROF-10201] Reduce allocation profiling overhead by using coarse timestamps #3797
Conversation
This will be used by the discrete dynamic sampler to make cheaper checks of when to re-evaluate if it should work again.
…estamps **What does this PR do?** This PR reduces the allocation profiling overhead (or: optimizes allocation profiling :D ) by using coarse timestamps on the `on_newobj_event` hot path. **Motivation:** When allocation profiling is enabled, the profiler gets called for almost every object allocated in the Ruby VM. Thus, this code path is extremely sensitive: the less work we can do before we return control over to Ruby, the less impact allocation profiling has on the application. The dynamic sampling rate mechanism we employ takes the current timestamp as an input, to decide if "enough time" has elapsed since it last readjusted itself. But "enough time" right now is *one second* and thus we can get away with using `CLOCK_MONOTONIC_COARSE` on Linux which is noticeably cheaper than the regular `CLOCK_MONOTONIC`. **Additional Notes:** Enabling the use of the monotonic clock _sometimes_ on the discrete dynamic sampler required it to "spill" some of its guts out to the caller, so that the caller could correctly use the coarse clock on the hot path. **How to test the change?** Here's my experiment to compare three different clock sources I evaluated: ```c++ // Build with // `g++ time_sources.cpp -o time_sources -lbenchmark -lpthread` // where benchmark is <https://github.com/google/benchmark> aka // `apt install libbenchmark1 libbenchmark-dev` on ubuntu/debian #include <benchmark/benchmark.h> #include <x86intrin.h> // For __rdtsc #include <ctime> // For clock_gettime static void BM_RDTSC(benchmark::State& state) { for (auto _ : state) { benchmark::DoNotOptimize(__rdtsc()); } } static void BM_ClockMonotonic(benchmark::State& state) { timespec ts; for (auto _ : state) { clock_gettime(CLOCK_MONOTONIC, &ts); benchmark::DoNotOptimize(ts); } } static void BM_ClockMonotonicCoarse(benchmark::State& state) { timespec ts; for (auto _ : state) { clock_gettime(CLOCK_MONOTONIC_COARSE, &ts); benchmark::DoNotOptimize(ts); } } BENCHMARK(BM_RDTSC); BENCHMARK(BM_ClockMonotonic); BENCHMARK(BM_ClockMonotonicCoarse); BENCHMARK_MAIN(); ``` Results on my machine: ``` ./time_sources --benchmark_repetitions=10 --benchmark_report_aggregates_only=true 2024-07-19T10:48:20+01:00 Running ./time_sources Run on (20 X 4900 MHz CPU s) CPU Caches: L1 Data 48 KiB (x10) L1 Instruction 32 KiB (x10) L2 Unified 1280 KiB (x10) L3 Unified 24576 KiB (x1) Load Average: 1.23, 1.30, 1.11 ***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead. ------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------- BM_RDTSC_mean 5.52 ns 5.52 ns 10 BM_RDTSC_median 5.44 ns 5.44 ns 10 BM_RDTSC_stddev 0.148 ns 0.147 ns 10 BM_RDTSC_cv 2.67 % 2.67 % 10 BM_ClockMonotonic_mean 15.8 ns 15.8 ns 10 BM_ClockMonotonic_median 15.4 ns 15.4 ns 10 BM_ClockMonotonic_stddev 1.07 ns 1.07 ns 10 BM_ClockMonotonic_cv 6.77 % 6.77 % 10 BM_ClockMonotonicCoarse_mean 5.92 ns 5.92 ns 10 BM_ClockMonotonicCoarse_median 5.93 ns 5.93 ns 10 BM_ClockMonotonicCoarse_stddev 0.041 ns 0.041 ns 10 BM_ClockMonotonicCoarse_cv 0.68 % 0.68 % 10 ``` and here's the result of running `benchmarks/profiler_allocation.rb` comparing master to this branch: ``` ruby 2.7.7p221 (2022-11-24 revision 168ec2b1e5) [x86_64-linux] Warming up -------------------------------------- Allocations (baseline) 1.431M i/100ms Calculating ------------------------------------- Allocations (baseline) 14.370M (± 2.0%) i/s - 144.541M in 10.062635s Warming up -------------------------------------- Allocations (master) 1.014M i/100ms Calculating ------------------------------------- Allocations (master) 10.165M (± 1.0%) i/s - 102.390M in 10.074151s Warming up -------------------------------------- Allocations (coarse) 1.179M i/100ms Calculating ------------------------------------- Allocations (coarse) 11.495M (± 2.5%) i/s - 115.573M in 10.059971s Comparison: Allocations (baseline): 14369960.1 i/s Allocations (coarse): 11495418.2 i/s - 1.25x slower Allocations (master): 10164615.7 i/s - 1.41x slower ``` I've specifically used Ruby 2.7 for this comparison since this benchmark had a lot more variance (including baseline) on latter Rubies, and I wanted to isolate the specific changes to this code path.
BenchmarksBenchmark execution time: 2024-07-22 10:57:07 Comparing candidate commit 446f2ad in PR branch Found 0 performance improvements and 0 performance regressions! Performance is the same for 10 metrics, 2 unstable metrics. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #3797 +/- ##
=======================================
Coverage 97.91% 97.91%
=======================================
Files 1248 1248
Lines 75192 75192
Branches 3638 3638
=======================================
Hits 73627 73627
Misses 1565 1565 ☔ View full report in Codecov by Sentry. |
**What does this PR do?** This PR introduces a new setting for the profiler: `profiling.advanced.allocation_counting_enabled` which controls the profiler allocation counting feature. This setting is off by default, enabling us to reduce allocation profiling overhead slightly. **Motivation:** We actually used to have this setting back in the ddtrace 1.x series. We introduced it in #2635 and removed it in #3281 -- by tieing it directly to allocation profiling. I decided to re-introduce it so we can disable this feature by default. **Additional Notes:** This PR sits atop #3797 because it makes sense to measure both overhead improvements together, but is otherwise completely independent. **How to test the change?** Here's the results for running the `profiler_allocation` benchmark: ``` ruby 2.7.7p221 (2022-11-24 revision 168ec2b1e5) [x86_64-linux] Warming up -------------------------------------- Allocations (baseline) 1.419M i/100ms Calculating ------------------------------------- Allocations (baseline) 14.535M (± 2.1%) i/s - 146.197M in 10.062717s Warming up -------------------------------------- Allocations (alloc_profiling_enabled) 1.122M i/100ms Calculating ------------------------------------- Allocations (alloc_profiling_enabled) 11.636M (± 2.2%) i/s - 116.679M in 10.032209s Warming up -------------------------------------- Allocations (alloc_profiling_disabled) 1.124M i/100ms Calculating ------------------------------------- Allocations (alloc_profiling_disabled) 11.866M (± 2.6%) i/s - 119.175M in 10.050580s Comparison: Allocations (baseline): 14534979.3 i/s Allocations (alloc_profiling_disabled): 11865926.7 i/s - 1.22x slower Allocations (alloc_profiling_enabled): 11635919.9 i/s - 1.25x slower ``` The difference is close to the margin of error; nevertheless this feature was showing up on the native profiler, and since it was on the hot path for allocation profiling, I think it's worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🏎️ 🔥
Thanks for the review! :) |
What does this PR do?
This PR reduces the allocation profiling overhead (or: optimizes allocation profiling :D ) by using coarse timestamps on the
on_newobj_event
hot path.Motivation:
When allocation profiling is enabled, the profiler gets called for almost every object allocated in the Ruby VM. Thus, this code path is extremely sensitive: the less work we can do before we return control over to Ruby, the less impact allocation profiling has on the application.
The dynamic sampling rate mechanism we employ takes the current timestamp as an input, to decide if "enough time" has elapsed since it last readjusted itself. But "enough time" right now is one second and thus we can get away with using
CLOCK_MONOTONIC_COARSE
on Linux which is noticeably cheaper than the regularCLOCK_MONOTONIC
.Additional Notes:
Enabling the use of the monotonic clock sometimes on the discrete dynamic sampler required it to "spill" some of its guts out to the caller, so that the caller could correctly use the coarse clock on the hot path.
How to test the change?
Here's my experiment to compare three different clock sources I evaluated:
Results on my machine:
and here's the result of running
benchmarks/profiler_allocation.rb
comparing master to this branch:I've specifically used Ruby 2.7 for this comparison since this benchmark had a lot more variance (including baseline) on latter Rubies, and I wanted to isolate the specific changes to this code path.