[NO-TICKET] Raise benchmark default durations #3869

ivoanjo · 2024-08-27T16:53:40Z

What does this PR do?

This PR goes through our existing benchmarks, and for those that used a default duration of 10 or 12 seconds, raises the duration to 30s.

Motivation:

We've observed most of these benchmarks having flaky regressions/improvements on PRs that shouldn't affect them.

I suspect that the running the benchmarks for such short time may be contributing to flakiness. For instance, one of the benchmarks that flakes the most often is profiler_sample_serialize.rb, and concidentally that's a benchmark where each iteration does a lot of work, and thus in a typical 10 second run we may only see around 70 iterations.

Hopefully by running the benchmarks for slightly longer we'll have more consistent results.

Additional Notes:

There is a downside to this -- because our benchmarks are currently executed sequentially in CI, this will make the benchmark run took quite longer than it used to.

Hopefully this trade-off is reasonable; if not, we can re-evaluate.

How to test the change?

Check the latest benchmarks CI run, and confirm the tests are running for 30 seconds, rather than 10/12.

codecov-commenter · 2024-08-27T17:18:22Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.85%. Comparing base (3679939) to head (ee157a1).
Report is 10 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3869      +/-   ##
==========================================
- Coverage   97.85%   97.85%   -0.01%     
==========================================
  Files        1271     1277       +6     
  Lines       76024    76304     +280     
  Branches     3740     3740              
==========================================
+ Hits        74395    74667     +272     
- Misses       1629     1637       +8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pr-commenter · 2024-08-27T17:45:44Z

Benchmarks

Benchmark execution time: 2024-08-28 17:05:59

Comparing candidate commit ee157a1 in PR branch ivoanjo/raise-benchmark-durations with baseline commit 0c87de3 in branch master.

Found 2 performance improvements and 2 performance regressions! Performance is the same for 21 metrics, 0 unstable metrics.

scenario:profiler - Major GC runs (profiling disabled)

🟩 throughput [+2.789op/s; +2.906op/s] or [+2.395%; +2.495%]

scenario:profiler - Major GC runs (profiling enabled)

🟩 throughput [+2.636op/s; +2.749op/s] or [+2.389%; +2.491%]

scenario:profiler - sample timeline=false

🟥 throughput [-0.452op/s; -0.439op/s] or [-6.267%; -6.084%]

scenario:tracing - Propagation - Datadog

🟥 throughput [-737.799op/s; -696.846op/s] or [-2.217%; -2.094%]

**What does this PR do?** This PR goes through our existing benchmarks, and for those that used a default duration of 10 or 12 seconds, raises the duration to 30s. **Motivation:** We've observed most of these benchmarks having flaky regressions/improvements on PRs that shouldn't affect them. I suspect that the running the benchmarks for such short time may be contributing to flakiness. For instance, one of the benchmarks that flakes the most often is `profiler_sample_serialize.rb`, and concidentally that's a benchmark where each iteration does a lot of work, and thus in a typical 10 second run we may only see around 70 iterations. Hopefully by running the benchmarks for slightly longer we'll have more consistent results. **Additional Notes:** There is a downside to this -- because our benchmarks are currently executed sequentially in CI, this will make the benchmark run took quite longer than it used to. Hopefully this trade-off is reasonable; if not, we can re-evaluate. **How to test the change?** Check the latest benchmarks CI run, and confirm the tests are running for 30 seconds, rather than 10/12.

Seems to still be unstable in a benchmarkiing run: ``` ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [x86_64-linux] Warming up -------------------------------------- profiler - hold / resume 203.634k i/100ms Calculating ------------------------------------- profiler - hold / resume 2.037M (± 0.2%) i/s - 61.294M in 30.096368s ``` vs ``` ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [x86_64-linux] Warming up -------------------------------------- profiler - hold / resume 210.917k i/100ms Calculating ------------------------------------- profiler - hold / resume 2.099M (± 0.2%) i/s - 63.064M in 30.040197s ``` Let's see if as a last-ditch attempt, raising the duration a bit more fixes it, or if we need to look into other options.

benchmarks/profiler_hold_resume_interruptions.rb

Still got an unstable result on the last run, let's see if 60 also helps here or not...

p-datadog

I support the changes if they make the benchmarks less flaky, but the runtime is now 45 minutes for this configuration? Should we consider splitting it?

ivoanjo · 2024-08-28T15:49:27Z

I support the changes if they make the benchmarks less flaky, but the runtime is now 45 minutes for this configuration? Should we consider splitting it?

Good point. To be honest, I'm still seeing too much variation on the benchmarks, so as a last desperate measure I'll try raising every one to 1 minute.

If one minute is reasonable (I'm starting to suspect it's not going to be), I think it's worth looking into the splitting...

I suspect this won't be enough to make the benchmarks consistent, but let's try and see how it goes.

ivoanjo · 2024-08-29T08:15:12Z

Ok, it seems like even with 60 seconds duration, we're still seeing bogus improvements/regressions being reported. Let me move this PR to draft for now while I follow-up on other solutions.

ivoanjo · 2024-09-19T04:55:24Z

We ended up tweaking the thresholds on the benchmarking platform and it reduced the variance a lot. There's still a few tests that seem to show wider variance -- I'll open up separate PRs for tweaking their duration as needed.

Closing this one for now.

ivoanjo requested a review from a team as a code owner August 27, 2024 16:53

ivoanjo added dev/ci Involves CircleCI, GitHub Actions, or GitLab dev/internal Other internal work that does not need to be included in the changelog labels Aug 27, 2024

github-actions bot added dev/testing Involves testing processes (e.g. RSpec) and removed dev/ci Involves CircleCI, GitHub Actions, or GitLab labels Aug 27, 2024

ivoanjo added 2 commits August 28, 2024 09:16

ivoanjo force-pushed the ivoanjo/raise-benchmark-durations branch from ec4fe81 to 01b37f8 Compare August 28, 2024 08:16

AlexJF approved these changes Aug 28, 2024

View reviewed changes

benchmarks/profiler_hold_resume_interruptions.rb Show resolved Hide resolved

Try raising tracing benchmark durations again

51d2271

Still got an unstable result on the last run, let's see if 60 also helps here or not...

AlexJF approved these changes Aug 28, 2024

View reviewed changes

p-datadog approved these changes Aug 28, 2024

View reviewed changes

Last ditch attempt: raise every benchmark to 60 seconds

ee157a1

I suspect this won't be enough to make the benchmarks consistent, but let's try and see how it goes.

ivoanjo marked this pull request as draft August 29, 2024 08:15

ivoanjo closed this Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NO-TICKET] Raise benchmark default durations #3869

[NO-TICKET] Raise benchmark default durations #3869

ivoanjo commented Aug 27, 2024

codecov-commenter commented Aug 27, 2024 •

edited

Loading

pr-commenter bot commented Aug 27, 2024 •

edited

Loading

p-datadog left a comment

ivoanjo commented Aug 28, 2024

ivoanjo commented Aug 29, 2024

ivoanjo commented Sep 19, 2024

[NO-TICKET] Raise benchmark default durations #3869

[NO-TICKET] Raise benchmark default durations #3869

Conversation

ivoanjo commented Aug 27, 2024

codecov-commenter commented Aug 27, 2024 • edited Loading

Codecov Report

pr-commenter bot commented Aug 27, 2024 • edited Loading

Benchmarks

scenario:profiler - Major GC runs (profiling disabled)

scenario:profiler - Major GC runs (profiling enabled)

scenario:profiler - sample timeline=false

scenario:tracing - Propagation - Datadog

p-datadog left a comment

Choose a reason for hiding this comment

ivoanjo commented Aug 28, 2024

ivoanjo commented Aug 29, 2024

ivoanjo commented Sep 19, 2024

codecov-commenter commented Aug 27, 2024 •

edited

Loading

pr-commenter bot commented Aug 27, 2024 •

edited

Loading