Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NO-TICKET] Raise benchmark default durations #3869

Closed
wants to merge 4 commits into from

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Aug 27, 2024

What does this PR do?

This PR goes through our existing benchmarks, and for those that used a default duration of 10 or 12 seconds, raises the duration to 30s.

Motivation:

We've observed most of these benchmarks having flaky regressions/improvements on PRs that shouldn't affect them.

I suspect that the running the benchmarks for such short time may be contributing to flakiness. For instance, one of the benchmarks that flakes the most often is profiler_sample_serialize.rb, and concidentally that's a benchmark where each iteration does a lot of work, and thus in a typical 10 second run we may only see around 70 iterations.

Hopefully by running the benchmarks for slightly longer we'll have more consistent results.

Additional Notes:

There is a downside to this -- because our benchmarks are currently executed sequentially in CI, this will make the benchmark run took quite longer than it used to.

Hopefully this trade-off is reasonable; if not, we can re-evaluate.

How to test the change?

Check the latest benchmarks CI run, and confirm the tests are running for 30 seconds, rather than 10/12.

@ivoanjo ivoanjo requested a review from a team as a code owner August 27, 2024 16:53
@ivoanjo ivoanjo added dev/ci Involves CircleCI, GitHub Actions, or GitLab dev/internal Other internal work that does not need to be included in the changelog labels Aug 27, 2024
@github-actions github-actions bot added dev/testing Involves testing processes (e.g. RSpec) and removed dev/ci Involves CircleCI, GitHub Actions, or GitLab labels Aug 27, 2024
@codecov-commenter
Copy link

codecov-commenter commented Aug 27, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.85%. Comparing base (3679939) to head (ee157a1).
Report is 10 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3869      +/-   ##
==========================================
- Coverage   97.85%   97.85%   -0.01%     
==========================================
  Files        1271     1277       +6     
  Lines       76024    76304     +280     
  Branches     3740     3740              
==========================================
+ Hits        74395    74667     +272     
- Misses       1629     1637       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pr-commenter
Copy link

pr-commenter bot commented Aug 27, 2024

Benchmarks

Benchmark execution time: 2024-08-28 17:05:59

Comparing candidate commit ee157a1 in PR branch ivoanjo/raise-benchmark-durations with baseline commit 0c87de3 in branch master.

Found 2 performance improvements and 2 performance regressions! Performance is the same for 21 metrics, 0 unstable metrics.

scenario:profiler - Major GC runs (profiling disabled)

  • 🟩 throughput [+2.789op/s; +2.906op/s] or [+2.395%; +2.495%]

scenario:profiler - Major GC runs (profiling enabled)

  • 🟩 throughput [+2.636op/s; +2.749op/s] or [+2.389%; +2.491%]

scenario:profiler - sample timeline=false

  • 🟥 throughput [-0.452op/s; -0.439op/s] or [-6.267%; -6.084%]

scenario:tracing - Propagation - Datadog

  • 🟥 throughput [-737.799op/s; -696.846op/s] or [-2.217%; -2.094%]

**What does this PR do?**

This PR goes through our existing benchmarks, and for those that used
a default duration of 10 or 12 seconds, raises the duration to 30s.

**Motivation:**

We've observed most of these benchmarks having flaky
regressions/improvements on PRs that shouldn't affect them.

I suspect that the running the benchmarks for such short time may
be contributing to flakiness. For instance, one of the benchmarks
that flakes the most often is `profiler_sample_serialize.rb`, and
concidentally that's a benchmark where each iteration does a lot
of work, and thus in a typical 10 second run we may only see around
70 iterations.

Hopefully by running the benchmarks for slightly longer we'll have
more consistent results.

**Additional Notes:**

There is a downside to this -- because our benchmarks are currently
executed sequentially in CI, this will make the benchmark run took
quite longer than it used to.

Hopefully this trade-off is reasonable; if not, we can re-evaluate.

**How to test the change?**

Check the latest benchmarks CI run, and confirm the tests are running
for 30 seconds, rather than 10/12.
Seems to still be unstable in a benchmarkiing run:

```
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [x86_64-linux]
Warming up --------------------------------------
profiler - hold / resume
                       203.634k i/100ms
Calculating -------------------------------------
profiler - hold / resume
                          2.037M (± 0.2%) i/s -     61.294M in  30.096368s
```

vs

```
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [x86_64-linux]
Warming up --------------------------------------
profiler - hold / resume
                       210.917k i/100ms
Calculating -------------------------------------
profiler - hold / resume
                          2.099M (± 0.2%) i/s -     63.064M in  30.040197s
```

Let's see if as a last-ditch attempt, raising the duration a bit
more fixes it, or if we need to look into other options.
@ivoanjo ivoanjo force-pushed the ivoanjo/raise-benchmark-durations branch from ec4fe81 to 01b37f8 Compare August 28, 2024 08:16
Still got an unstable result on the last run, let's see if 60 also
helps here or not...
Copy link
Contributor

@p-datadog p-datadog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I support the changes if they make the benchmarks less flaky, but the runtime is now 45 minutes for this configuration? Should we consider splitting it?

@ivoanjo
Copy link
Member Author

ivoanjo commented Aug 28, 2024

I support the changes if they make the benchmarks less flaky, but the runtime is now 45 minutes for this configuration? Should we consider splitting it?

Good point. To be honest, I'm still seeing too much variation on the benchmarks, so as a last desperate measure I'll try raising every one to 1 minute.

If one minute is reasonable (I'm starting to suspect it's not going to be), I think it's worth looking into the splitting...

I suspect this won't be enough to make the benchmarks consistent, but
let's try and see how it goes.
@ivoanjo
Copy link
Member Author

ivoanjo commented Aug 29, 2024

Ok, it seems like even with 60 seconds duration, we're still seeing bogus improvements/regressions being reported. Let me move this PR to draft for now while I follow-up on other solutions.

@ivoanjo ivoanjo marked this pull request as draft August 29, 2024 08:15
@ivoanjo
Copy link
Member Author

ivoanjo commented Sep 19, 2024

We ended up tweaking the thresholds on the benchmarking platform and it reduced the variance a lot. There's still a few tests that seem to show wider variance -- I'll open up separate PRs for tweaking their duration as needed.

Closing this one for now.

@ivoanjo ivoanjo closed this Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dev/internal Other internal work that does not need to be included in the changelog dev/testing Involves testing processes (e.g. RSpec)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants