-
Notifications
You must be signed in to change notification settings - Fork 532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(ci): Upload stage telemetry separately for each stage #22616
refactor(ci): Upload stage telemetry separately for each stage #22616
Conversation
@@ -10,60 +10,74 @@ interface ParsedJob { | |||
stageName: string; | |||
startTime: number; | |||
finishTime: number; | |||
totalTime: number; | |||
totalSeconds: number; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for clarity within this file. The telemetry event generated in the end still uses duration
.
scripts/get-test-pass-rate.mjs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one was simplified because it doesn't have to first retrieve the list of stages, it is now passed a specific stage id from the environment.
|
||
stages: | ||
- template: templates/include-conditionally-run-stress-tests.yml | ||
parameters: | ||
artifactBuildId: $(resources.pipeline.client.runID) | ||
packages: ${{ parameters.packages }} | ||
testWorkspace: ${{ variables.testWorkspace }} | ||
|
||
# Capture telemetry about pipeline stages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this is now covered by include-test-real-service.yml
, which include-conditionally-run-stress-tests.yml
in line 68 already uses.
The same chunk is gone from the bottom of some other pipeline files for the same reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Performance Benchmarks pipeline doesn't use the include-test-real-service.yml
template that now includes a stage for telemetry upload, so it directly includes include-upload-stage-telemetry.yml
itself for each stage where it needs it.
secureFile: ${{ parameters.r11sSelfSignedCertSecureFile }} | ||
retryCount: '2' | ||
|
||
stages: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This template used to define just a list of jobs, so pipelines that include it used to define the stage themselves. Now this template defines the stage, so it can also define the stage that uploads the telemetry.
…ent needs several at a time.
- publish_npm_internal_dev | ||
- ${{ if or(eq(variables['release'], 'release'), eq(variables['release'], 'prerelease')) }}: | ||
- publish_npm_public | ||
# Note: the publish stages are created in include-publish-npm-package.yml. We need to match the ids exactly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized most of these stages always exist, so we can declare the dependencies without conditions.
jobs: | ||
- job: upload_run_telemetry | ||
displayName: Upload pipeline run telemetry to Kusto | ||
pool: Small-1ES | ||
variables: | ||
- group: ado-feeds | ||
- name: pipelineTelemetryWorkdir | ||
value: $(Pipeline.Workspace)/pipelineTelemetryWorkdir | ||
value: $(Pipeline.Workspace)/pipelineTelemetryWorkdir/timingOutput |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We always appended timingOutput
so might as well include it in the variable itself.
inputs: | ||
targetType: inline | ||
workingDirectory: $(absolutePathToTelemetryGenerator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that the task is split into two, "get data" and "send data to Kusto", the "get data" piece doesn't need to run in this directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good to me.
If we end up breaking up some pipelines per service, this makes that easier.
Thanks for the review @jason-ha ! I also just confirmed that the expected telemetry from all my test runs shows correctly in Kusto. |
Description
This refactors our setup to upload telemetry for stages during ADO pipeline runs.
One of the main problems the current setup has is that it waits for all "target" stages (the ones whose telemetry we want to upload) in a given pipeline to complete before the stage that uploads telemetry runs. In the E2E tests pipeline, where the stage that runs tests against ODSP usually takes ~2hrs to run but can also wait many hours to start (because we have exclusive locks so only one pipeline run can execute tests against a given external service at the same time, and thus the stage has to wait for the corresponding stage in previous pipeline runs to finish), the telemetry for other stages can severely lag the actual time when things happened. This can cause confusion when our OCEs get IcM incidents, because the thing that caused the incident to fire happened many hours ago (sometimes the previous day).
The refactor in this PR makes it so instead of having a single stage at the end of a pipeline run which uploads the telemetry for all other relevant stages in that run, we now have one for each of the relevant "target" stages. The new stage depends only on the "target" stage, so it runs immediately after it. This does have the disadvantage that we now have many more stages in a pipeline run, each one needs to be scheduled on an available build agent, and they all run similar steps, like checking out the repository. So the total usage time for build agents will probably go up a bit. The monetary cost should not be significant, though, so I think this is fine. All of this applies to the test pipelines; I kept the existing setup for Build - client because in that one we don't really care about tracking each stage separately, we really only care about the pipeline as a whole.
The refactor also entailed some cleanup and improvements on some JS/TS code related to stage telemetry. The scripts that get the run information for a stage now can take a specific
STAGE_ID
from the environment instead of getting the list of stage ids themselves. They're also more aggressive with validation of the inputs they expect.Reviewer Guidance
The review process is outlined on this wiki page.
Runs of the test pipelines with these changes (msft internal):
true
in a parameter so it would run the stage that generates telemetry, even if it was for atest/
branch; then undid that change for the PR).AB#11566