-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NO-TICKET] Raise benchmark default durations #3869
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #3869 +/- ##
==========================================
- Coverage 97.85% 97.85% -0.01%
==========================================
Files 1271 1277 +6
Lines 76024 76304 +280
Branches 3740 3740
==========================================
+ Hits 74395 74667 +272
- Misses 1629 1637 +8 ☔ View full report in Codecov by Sentry. |
BenchmarksBenchmark execution time: 2024-08-28 17:05:59 Comparing candidate commit ee157a1 in PR branch Found 2 performance improvements and 2 performance regressions! Performance is the same for 21 metrics, 0 unstable metrics. scenario:profiler - Major GC runs (profiling disabled)
scenario:profiler - Major GC runs (profiling enabled)
scenario:profiler - sample timeline=false
scenario:tracing - Propagation - Datadog
|
**What does this PR do?** This PR goes through our existing benchmarks, and for those that used a default duration of 10 or 12 seconds, raises the duration to 30s. **Motivation:** We've observed most of these benchmarks having flaky regressions/improvements on PRs that shouldn't affect them. I suspect that the running the benchmarks for such short time may be contributing to flakiness. For instance, one of the benchmarks that flakes the most often is `profiler_sample_serialize.rb`, and concidentally that's a benchmark where each iteration does a lot of work, and thus in a typical 10 second run we may only see around 70 iterations. Hopefully by running the benchmarks for slightly longer we'll have more consistent results. **Additional Notes:** There is a downside to this -- because our benchmarks are currently executed sequentially in CI, this will make the benchmark run took quite longer than it used to. Hopefully this trade-off is reasonable; if not, we can re-evaluate. **How to test the change?** Check the latest benchmarks CI run, and confirm the tests are running for 30 seconds, rather than 10/12.
Seems to still be unstable in a benchmarkiing run: ``` ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [x86_64-linux] Warming up -------------------------------------- profiler - hold / resume 203.634k i/100ms Calculating ------------------------------------- profiler - hold / resume 2.037M (± 0.2%) i/s - 61.294M in 30.096368s ``` vs ``` ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [x86_64-linux] Warming up -------------------------------------- profiler - hold / resume 210.917k i/100ms Calculating ------------------------------------- profiler - hold / resume 2.099M (± 0.2%) i/s - 63.064M in 30.040197s ``` Let's see if as a last-ditch attempt, raising the duration a bit more fixes it, or if we need to look into other options.
ec4fe81
to
01b37f8
Compare
Still got an unstable result on the last run, let's see if 60 also helps here or not...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I support the changes if they make the benchmarks less flaky, but the runtime is now 45 minutes for this configuration? Should we consider splitting it?
Good point. To be honest, I'm still seeing too much variation on the benchmarks, so as a last desperate measure I'll try raising every one to 1 minute. If one minute is reasonable (I'm starting to suspect it's not going to be), I think it's worth looking into the splitting... |
I suspect this won't be enough to make the benchmarks consistent, but let's try and see how it goes.
Ok, it seems like even with 60 seconds duration, we're still seeing bogus improvements/regressions being reported. Let me move this PR to draft for now while I follow-up on other solutions. |
We ended up tweaking the thresholds on the benchmarking platform and it reduced the variance a lot. There's still a few tests that seem to show wider variance -- I'll open up separate PRs for tweaking their duration as needed. Closing this one for now. |
What does this PR do?
This PR goes through our existing benchmarks, and for those that used a default duration of 10 or 12 seconds, raises the duration to 30s.
Motivation:
We've observed most of these benchmarks having flaky regressions/improvements on PRs that shouldn't affect them.
I suspect that the running the benchmarks for such short time may be contributing to flakiness. For instance, one of the benchmarks that flakes the most often is
profiler_sample_serialize.rb
, and concidentally that's a benchmark where each iteration does a lot of work, and thus in a typical 10 second run we may only see around 70 iterations.Hopefully by running the benchmarks for slightly longer we'll have more consistent results.
Additional Notes:
There is a downside to this -- because our benchmarks are currently executed sequentially in CI, this will make the benchmark run took quite longer than it used to.
Hopefully this trade-off is reasonable; if not, we can re-evaluate.
How to test the change?
Check the latest benchmarks CI run, and confirm the tests are running for 30 seconds, rather than 10/12.