-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase the number of times CI runs benchmarks #49092
base: master
Are you sure you want to change the base?
Conversation
622b657
to
3ffcce1
Compare
@r0mant @doggydogworld what are your thoughts on this change? The attempt is to provide more reliable information by running the benchmark tests a few times, though that comes at the expense of the tests themselves taking longer to complete. The benckmark tests are still however faster than unit and integration tests. |
Yeah I originally chose the 1x to ensure that it finishes ASAP to not be a blocker and to prove that changes won't break the benchmark. Running 6 times minimum would be necessary to test for performance regressions so this eventually was going to make it in. Since we've been able to observe so far that 6 times probably won't slow CI down too much I think it's good to do it now. The eventual goal would then be to create a baseline (which would have more than 6 runs) to compare with these results to determine regressions. |
Makefile
Outdated
@@ -943,14 +943,14 @@ endif | |||
test-go-bench: PACKAGES = $(shell grep --exclude-dir api --include "*_test.go" -lr testing.B . | xargs dirname | xargs go list | sort -u) | |||
test-go-bench: BENCHMARK_SKIP_PATTERN = "^BenchmarkRoot" | |||
test-go-bench: | $(TEST_LOG_DIR) | |||
go test -run ^$$ -bench . -skip $(BENCHMARK_SKIP_PATTERN) -benchtime 1x $(PACKAGES) \ | |||
go test -run ^$$ -bench . -skip $(BENCHMARK_SKIP_PATTERN) -count=6 -benchtime 1x $(PACKAGES) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we take advantage of the increased runs in any way? I can see this being part of a larger change that makes use of the bench results, but on its own I'm not sure it adds much. (Even so I'm a bit skeptical that every benchmark we have was built with this in mind.)
Also notice that we are still running 1x, so even though the counts are higher the benchmark itself is still minimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd like to use our benchmarks as a signal that nothing has broken or regressed between releases. However, right now output between two consecutive tags on a release branch can vary quite dramatically. I'm hoping this might have some impact on stabilizing things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we increase the benchtime too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not against it, I was trying to limit the changes here to keep test times from ballooning. Do you have a recommendation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's hard to pitch a generic number, my 2c is that the flags should be on a per-benchmark basis. It also has some to do with how long we want the jobs to take.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it seems we have too many benchmarks that adding 6s for each one really balloons it. I think this is a good first step at least to get some data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like we'll want to revert to -benchtime=1x but I'll leave my stamp regardless. Thanks for the all the explanations here, folks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, one last suggestion: what about -benchtime=6x
instead of -benchtime=1x -count=6
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll give that a try. I don't really want to merge this as is and require all PRs to wait >30 minutes just so the benchmark tests can finish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With -benchtime=1x -count=6
we are back to need >= 6 samples for confidence interval at level 0.95
This is in an attempt to provide more reliable test results. 6 was chosen as that's the minimum number benchstat requires for a 95% confidence interval.
1dcfe4f
to
ede6752
Compare
This is in an attempt to provide more reliable test results. 6 was chosen as that's the minimum number benchstat requires for a 95% confidence interval.
Example of warning emitted from benchstat in CI: https://github.com/gravitational/teleport/actions/runs/11862270592