Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(ci): Upload stage telemetry separately for each stage #22616

Merged
merged 42 commits into from
Sep 26, 2024

Conversation

alexvy86
Copy link
Contributor

@alexvy86 alexvy86 commented Sep 24, 2024

Description

This refactors our setup to upload telemetry for stages during ADO pipeline runs.

One of the main problems the current setup has is that it waits for all "target" stages (the ones whose telemetry we want to upload) in a given pipeline to complete before the stage that uploads telemetry runs. In the E2E tests pipeline, where the stage that runs tests against ODSP usually takes ~2hrs to run but can also wait many hours to start (because we have exclusive locks so only one pipeline run can execute tests against a given external service at the same time, and thus the stage has to wait for the corresponding stage in previous pipeline runs to finish), the telemetry for other stages can severely lag the actual time when things happened. This can cause confusion when our OCEs get IcM incidents, because the thing that caused the incident to fire happened many hours ago (sometimes the previous day).

The refactor in this PR makes it so instead of having a single stage at the end of a pipeline run which uploads the telemetry for all other relevant stages in that run, we now have one for each of the relevant "target" stages. The new stage depends only on the "target" stage, so it runs immediately after it. This does have the disadvantage that we now have many more stages in a pipeline run, each one needs to be scheduled on an available build agent, and they all run similar steps, like checking out the repository. So the total usage time for build agents will probably go up a bit. The monetary cost should not be significant, though, so I think this is fine. All of this applies to the test pipelines; I kept the existing setup for Build - client because in that one we don't really care about tracking each stage separately, we really only care about the pipeline as a whole.

The refactor also entailed some cleanup and improvements on some JS/TS code related to stage telemetry. The scripts that get the run information for a stage now can take a specific STAGE_ID from the environment instead of getting the list of stage ids themselves. They're also more aggressive with validation of the inputs they expect.

Reviewer Guidance

The review process is outlined on this wiki page.

Runs of the test pipelines with these changes (msft internal):

AB#11566

@github-actions github-actions bot added area: build Build related issues base: main PRs targeted against main branch labels Sep 24, 2024
@@ -10,60 +10,74 @@ interface ParsedJob {
stageName: string;
startTime: number;
finishTime: number;
totalTime: number;
totalSeconds: number;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for clarity within this file. The telemetry event generated in the end still uses duration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one was simplified because it doesn't have to first retrieve the list of stages, it is now passed a specific stage id from the environment.


stages:
- template: templates/include-conditionally-run-stress-tests.yml
parameters:
artifactBuildId: $(resources.pipeline.client.runID)
packages: ${{ parameters.packages }}
testWorkspace: ${{ variables.testWorkspace }}

# Capture telemetry about pipeline stages
Copy link
Contributor Author

@alexvy86 alexvy86 Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this is now covered by include-test-real-service.yml, which include-conditionally-run-stress-tests.yml in line 68 already uses.

The same chunk is gone from the bottom of some other pipeline files for the same reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Performance Benchmarks pipeline doesn't use the include-test-real-service.yml template that now includes a stage for telemetry upload, so it directly includes include-upload-stage-telemetry.yml itself for each stage where it needs it.

secureFile: ${{ parameters.r11sSelfSignedCertSecureFile }}
retryCount: '2'

stages:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This template used to define just a list of jobs, so pipelines that include it used to define the stage themselves. Now this template defines the stage, so it can also define the stage that uploads the telemetry.

- publish_npm_internal_dev
- ${{ if or(eq(variables['release'], 'release'), eq(variables['release'], 'prerelease')) }}:
- publish_npm_public
# Note: the publish stages are created in include-publish-npm-package.yml. We need to match the ids exactly.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized most of these stages always exist, so we can declare the dependencies without conditions.

jobs:
- job: upload_run_telemetry
displayName: Upload pipeline run telemetry to Kusto
pool: Small-1ES
variables:
- group: ado-feeds
- name: pipelineTelemetryWorkdir
value: $(Pipeline.Workspace)/pipelineTelemetryWorkdir
value: $(Pipeline.Workspace)/pipelineTelemetryWorkdir/timingOutput
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always appended timingOutput so might as well include it in the variable itself.

inputs:
targetType: inline
workingDirectory: $(absolutePathToTelemetryGenerator)
Copy link
Contributor Author

@alexvy86 alexvy86 Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that the task is split into two, "get data" and "send data to Kusto", the "get data" piece doesn't need to run in this directory.

@alexvy86 alexvy86 requested review from a team September 25, 2024 17:02
@jason-ha
Copy link
Contributor

Would be nice if ADO let you just specify ${{ parameters }} instead of breaking out all the ones needed.


Refers to: tools/pipelines/templates/include-test-real-service.yml:443 in b2f48de. [](commit_id = b2f48de, deletion_comment = False)

Copy link
Contributor

@jason-ha jason-ha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me.
If we end up breaking up some pipelines per service, this makes that easier.

@alexvy86
Copy link
Contributor Author

Thanks for the review @jason-ha ! I also just confirmed that the expected telemetry from all my test runs shows correctly in Kusto.

@alexvy86 alexvy86 merged commit c126bba into microsoft:main Sep 26, 2024
52 checks passed
@alexvy86 alexvy86 deleted the per-stage-telemetry-upload branch September 26, 2024 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: build Build related issues base: main PRs targeted against main branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants