-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate expensive test in elastic-agent CI to a separate pipeline #4710
Comments
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
@pchila @blakerouse is there some easy way already to invoke these tests separately, e.g. a I imagine once we have an easy way to invoke these tests separately, it's a matter of editing the buildkite pipeline definition to run both test jobs. |
We already have mage targets that run the tests... they expect the necessary packages to be present though so we can (either/or):
I believe that option 1. is easier and quicker... and we can always make it evolve toward 2. later if we see value in it |
I talked to @pazone and he acknowledged that it's possible to do and asked to assign this ticket to him. |
The integration testing framework already supports the execution of specific groups https://github.com/elastic/elastic-agent/blob/main/docs/test-framework-dev-guide.md#selecting-specific-group That could be used to only execute a set of groups for different types. That might be too granular as it would require the selection of groups and any time that a new group is added it would require adjusting the job. Adding something like |
We can also avoid running ITs on each PR commit. For example, we setup a special PR comment that will trigger the separate extended testing pipeline. |
We should introduce the monorepo plugin so we stop pointlessly triggering integration tests. That can be done separately from this issue. #3674 |
I have wanted two things for a while as ergonomic improvements that I think would fit under this:
Finally, for the concurrency issue are there any built in Buildkite features we can use to help this quickly? https://buildkite.com/docs/pipelines/controlling-concurrency for example? |
As long as this still works from a local developer system. I do not want to have to think about having to upload to GCP before running the tests or anything like that.
|
@pazone for now the important part in order to stabilize the CI is to extract the following steps in a separate pipeline:
This separate pipeline should have a limited concurrency level. Let's start with 4 maximum concurrent builds. So, we cannot run more than 4 concurrent steps of the type listed above. Once we have this, we can iterate and extend functionality / granularity as suggested in the comments here. Unless @cmacknz has any objection. |
Why 4? I don't want us all to be waiting around for more hours than necessary. I know we have to pick a number, but what problem are we specifically trying to solve with this? Reducing number of VMs below some quota? What quota? Where is it set? Other option here is to reduce the number of unique test groups in the integration tests, which create individual VM instances, or to refactor the three separate targets into one target to let them all run sequentially. |
@cmacknz we need to start with some number and iterate. We can start with 16, 10, any number and then see how it affects the CI. 4 is the lowest possible number I think we can go with and then see how stable the CI has become. If the CI is fully stable we increase the next week until we start to see instabilities. Changing this number is relatively easy, the important part is to have some limit in place that we can adjust. |
Is there a way we can not guess? Can we observe how many active VMs we have at a time in GCP and go off of that? If we have to guess that works but it would be faster to know what universe of scale we are already operating at. How many VMs does a single integration test run require right now? For example, counting manually it looks like we have 9 unique test groups, which means that there should be something like at least 9 VMs of each machine type that needs to be scheduled simultaneously. I see:
|
@cmacknz even if we calculated the precise amount of VMs needed per build, we still would not know the failure/instability point of the GCP API. I don't have historical data for the amount of VMs we had running when GCP started failing for us. If it's ~9 VMs per running build, 4 concurrent builds would already result in 36 VMs, plus we can guarantee their shut down only after 4 hours. Which makes the amount of running VMs at the same time somewhat unpredictable, this makes me think we have to guess. The number can be anything, we can start with 8, does not matter. But we must start with something and then iterate. It's like |
My main concern is that without understanding in detail what the limit is, or exactly what problem we are working around, we may permanently make CI slower without an explanation or a path to fixing the underlying problem. I do not want us to arbitrarily limit the number of CI jobs and permanently cap developer productivity without an actual reason. Are we requesting too many machines at once? Is the type of machine we are requesting not as available as others we could use instead? Are we using the API wrong and we are supposed to overflow into another AZ in the same region when we get capacity related errors? Do we need a capacity reservation with GCP itself to guarantee available (would this also be cheaper?)? If you don't want to try an dig into the root cause at least do an analysis of if the reduced flakiness by capping concurrent builds actually saves us time over running more builds in parallel with the risk that some of them fail. If we can make CI 100% reliable by executing a single build at a time, we have solved the flakiness problem at the expense of greatly limiting our throughput as a team. I will happily trade off one flaky build per day for unlimited concurrency for example at the other end of this spectrum. |
For comparison, the Beats CI buildkite jobs look to have at least as many concurrent jobs on Buildkite agents if not more. How are they avoiding capacity issues there? That repository is also much more active than agent. |
@cmacknz but I didn't suggest to keep this low number forever, we start with some number and then gradually increase until we hit instability issues again. It's going to be a bit slower in the beginning, yes.
well, that's our current situation, why do we even consider to change anything then? We can as well keep it as it is.
I don't think we can compare here: every agent build creates a large amount of VMs. To my knowledge, Beats does not do that. My understanding is that the fact of creating massive amount of VMs concurrently using the GCP API is the source of our problems. We can have a lot of builds (main, 8.14, 8.13, daily, PRs) running at the same time. Every build has 3 different steps that create VMs. I suppose we just overload the GCP scheduler in this zone but I don't have any data to support this claim. |
Flaky tests are a net negative to developer productivity and PR throughput. We want to drive them to zero as a way of optimizing developer productivity. I view the primary goal of eliminating flaky tests to be increasing developer productivity. This means that solutions that eliminate flaky tests with the tradeoff that developer productivity and throughput is worse than it already is are a net negative to this goal (with room for nuance, eliminating flakiness by being 1 minute slower is obviously fine). This is why I don't like the idea of just limiting build parallelism without a strategy to eliminate it. If we learn something from briefly reducing parallelism, like we want to confirm it fixes the problem, or try to find the point where it fails exactly, then I'm OK with that as a temporary measure to help move the solution forward. I just don't want us to permanently reduce parallelism and flakiness at the expense of developer productivity, because we are really using flakiness as a proxy measurement of wasted developer time. Overall developer productivity is the result we actually care about. |
Now it's separated. Each test type has a dedicated concurrency group limited by 8. The values can be adjusted in |
In order to have a bit more control on how many concurrent serverless/extended leak/integration tests we are running in parallel at any given time we may think of creating a separate pipeline that runs only those tests.
For example we can create a pipeline named
elastic-agent-extended-testing
that will create a build for a PR or a release branch commit or may be even invoked by the currentelastic-agent
pipeline (I am not very knowledgeable of all the buildkite features and settings) that may start from a commit (or even from already created packages) and run our longer tests.We may set up this pipeline to only run a certain number of builds in parallel giving us one more knob we can turn to slow down the impact we have on cloud environments (especially wrt exhausting ARM VMs or other scarce resources in a zone/region).
Edit: if we implement a separate pipeline for extended testing we will need to adjust the PR checks ( we will need an additional required check from the new build) and also our build indicator for
main
and release branches (the small green tick or red x next to our commits in Github)The text was updated successfully, but these errors were encountered: