refactor(ci): Upload stage telemetry separately for each stage #22616

alexvy86 · 2024-09-24T23:01:43Z

Description

This refactors our setup to upload telemetry for stages during ADO pipeline runs.

One of the main problems the current setup has is that it waits for all "target" stages (the ones whose telemetry we want to upload) in a given pipeline to complete before the stage that uploads telemetry runs. In the E2E tests pipeline, where the stage that runs tests against ODSP usually takes ~2hrs to run but can also wait many hours to start (because we have exclusive locks so only one pipeline run can execute tests against a given external service at the same time, and thus the stage has to wait for the corresponding stage in previous pipeline runs to finish), the telemetry for other stages can severely lag the actual time when things happened. This can cause confusion when our OCEs get IcM incidents, because the thing that caused the incident to fire happened many hours ago (sometimes the previous day).

The refactor in this PR makes it so instead of having a single stage at the end of a pipeline run which uploads the telemetry for all other relevant stages in that run, we now have one for each of the relevant "target" stages. The new stage depends only on the "target" stage, so it runs immediately after it. This does have the disadvantage that we now have many more stages in a pipeline run, each one needs to be scheduled on an available build agent, and they all run similar steps, like checking out the repository. So the total usage time for build agents will probably go up a bit. The monetary cost should not be significant, though, so I think this is fine. All of this applies to the test pipelines; I kept the existing setup for Build - client because in that one we don't really care about tracking each stage separately, we really only care about the pipeline as a whole.

The refactor also entailed some cleanup and improvements on some JS/TS code related to stage telemetry. The scripts that get the run information for a stage now can take a specific STAGE_ID from the environment instead of getting the list of stage ids themselves. They're also more aggressive with validation of the inputs they expect.

Reviewer Guidance

The review process is outlined on this wiki page.

Runs of the test pipelines with these changes (msft internal):

Build - client (had to hardcode a true in a parameter so it would run the stage that generates telemetry, even if it was for a test/ branch; then undid that change for the PR).
Real Service End to End Tests
Real Service Stress Tests
Performance Benchmarks
Service Clients End to End Tests
Test - DDS Stress

AB#11566

alexvy86 · 2024-09-24T23:35:58Z

tools/telemetry-generator/src/handlers/stageTimingRetriever.ts

@@ -10,60 +10,74 @@ interface ParsedJob {
 	stageName: string;
 	startTime: number;
 	finishTime: number;
-	totalTime: number;
+	totalSeconds: number;


Just for clarity within this file. The telemetry event generated in the end still uses duration.

alexvy86 · 2024-09-24T23:38:01Z

scripts/get-test-pass-rate.mjs

This one was simplified because it doesn't have to first retrieve the list of stages, it is now passed a specific stage id from the environment.

alexvy86 · 2024-09-24T23:41:23Z

tools/pipelines/test-dds-stress.yml


 stages:
  - template: templates/include-conditionally-run-stress-tests.yml
    parameters:
      artifactBuildId: $(resources.pipeline.client.runID)
      packages: ${{ parameters.packages }}
      testWorkspace: ${{ variables.testWorkspace }}
-
-  # Capture telemetry about pipeline stages


All this is now covered by include-test-real-service.yml, which include-conditionally-run-stress-tests.yml in line 68 already uses.

The same chunk is gone from the bottom of some other pipeline files for the same reason.

alexvy86 · 2024-09-24T23:44:46Z

tools/pipelines/test-perf-benchmarks.yml

The Performance Benchmarks pipeline doesn't use the include-test-real-service.yml template that now includes a stage for telemetry upload, so it directly includes include-upload-stage-telemetry.yml itself for each stage where it needs it.

alexvy86 · 2024-09-24T23:49:48Z

tools/pipelines/templates/include-test-real-service.yml

-            secureFile: ${{ parameters.r11sSelfSignedCertSecureFile }}
-            retryCount: '2'
-
+stages:


This template used to define just a list of jobs, so pipelines that include it used to define the stage themselves. Now this template defines the stage, so it can also define the stage that uploads the telemetry.

…ng aria-logger.

…ent needs several at a time.

alexvy86 · 2024-09-25T16:51:31Z

tools/pipelines/templates/build-npm-package.yml

-                - publish_npm_internal_dev
-              - ${{ if or(eq(variables['release'], 'release'), eq(variables['release'], 'prerelease')) }}:
-                - publish_npm_public
+            # Note: the publish stages are created in include-publish-npm-package.yml. We need to match the ids exactly.


I realized most of these stages always exist, so we can declare the dependencies without conditions.

alexvy86 · 2024-09-25T16:52:51Z

tools/pipelines/templates/build-npm-package.yml

          jobs:
            - job: upload_run_telemetry
              displayName: Upload pipeline run telemetry to Kusto
              pool: Small-1ES
              variables:
              - group: ado-feeds
              - name: pipelineTelemetryWorkdir
-                value: $(Pipeline.Workspace)/pipelineTelemetryWorkdir
+                value: $(Pipeline.Workspace)/pipelineTelemetryWorkdir/timingOutput


We always appended timingOutput so might as well include it in the variable itself.

alexvy86 · 2024-09-25T16:53:56Z

tools/pipelines/templates/build-npm-package.yml

                inputs:
                  targetType: inline
-                  workingDirectory: $(absolutePathToTelemetryGenerator)


Now that the task is split into two, "get data" and "send data to Kusto", the "get data" piece doesn't need to run in this directory.

This reverts commit 629020e.

jason-ha · 2024-09-25T20:46:21Z

Would be nice if ADO let you just specify ${{ parameters }} instead of breaking out all the ones needed.

Refers to: tools/pipelines/templates/include-test-real-service.yml:443 in b2f48de. [](commit_id = b2f48de, deletion_comment = False)

tools/pipelines/build-client.yml

tools/telemetry-generator/src/handlers/stageTimingRetriever.ts

jason-ha

Looks pretty good to me.
If we end up breaking up some pipelines per service, this makes that easier.

alexvy86 · 2024-09-26T15:38:34Z

Thanks for the review @jason-ha ! I also just confirmed that the expected telemetry from all my test runs shows correctly in Kusto.

alexvy86 added 17 commits September 23, 2024 17:30

WIP Upload telemetry separately for each e2e test stage

47226d6

Improve stageTimingRetriever handler

58be33c

Fix template include path

33bbcfd

Fix another include path

49d2f8c

Fix parameter usage

7632046

Fix field in output

0c91003

Re-enable other stages and test pass rate

b52fd3b

Fix use of stage variables parameter

482d164

Adjust display name

0073eb0

Migrate stress tests to the new model

e3ad0ea

Move service clients e2e tests to new model

c8eb395

Reorder

ecf6599

Move DDS stress test pipeline to new model

42a0a80

Fix include path

0ae2f66

Adjust display names

5a22fea

Move performance benchmarks to new model

5c93798

Reuse new template in test dds stress pipeline

021b6a1

github-actions bot added area: build Build related issues base: main PRs targeted against main branch labels Sep 24, 2024

alexvy86 added 4 commits September 24, 2024 18:10

Add necessary variables in stage that uploads telemetry

c3f7003

Fix string interpolation

0da7458

Remove redundant-ish intermediate template

4c1583e

Update Test DDS stress pipeline

2f0bf1d

alexvy86 commented Sep 24, 2024

View reviewed changes

alexvy86 added 2 commits September 24, 2024 18:46

Reorder

93adfd4

Fix package name

12d2c9f

alexvy86 commented Sep 24, 2024

View reviewed changes

alexvy86 added 8 commits September 25, 2024 10:26

Split retrieve and submit steps. Fix pipeline identifier. Back to usi…

e3b7666

…ng aria-logger.

Handle skipped stages

2677479

Formatting

4eb347c

Back to allowing one or more stages in stageTimingRetriever. BuildCli…

355f12d

…ent needs several at a time.

Split tasks in build-npm-package

24d0542

Minor adjustments

377c07e

Testing

39d7441

Update dependencies

e576007

alexvy86 commented Sep 25, 2024

View reviewed changes

Undo bit for testing

629020e

alexvy86 requested review from a team September 25, 2024 17:02

alexvy86 added 4 commits September 25, 2024 13:22

Fix

ca477d0

Cleanup

a28dc04

Revert "Undo bit for testing"

1bf8a0c

This reverts commit 629020e.

Oops

b2f48de

jason-ha reviewed Sep 25, 2024

View reviewed changes

tools/pipelines/build-client.yml Outdated Show resolved Hide resolved

jason-ha reviewed Sep 25, 2024

View reviewed changes

tools/telemetry-generator/src/handlers/stageTimingRetriever.ts Show resolved Hide resolved

jason-ha reviewed Sep 25, 2024

View reviewed changes

tools/telemetry-generator/src/handlers/stageTimingRetriever.ts Outdated Show resolved Hide resolved

jason-ha reviewed Sep 25, 2024

View reviewed changes

alexvy86 added 3 commits September 25, 2024 16:34

PR feedback

3b51a53

Minor docs fix

5cfb2de

Undo testing change

4c49320

jason-ha approved these changes Sep 25, 2024

View reviewed changes

alexvy86 merged commit c126bba into microsoft:main Sep 26, 2024
52 checks passed

alexvy86 deleted the per-stage-telemetry-upload branch September 26, 2024 15:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(ci): Upload stage telemetry separately for each stage #22616

refactor(ci): Upload stage telemetry separately for each stage #22616

alexvy86 commented Sep 24, 2024 •

edited

Loading

alexvy86 Sep 24, 2024

alexvy86 Sep 24, 2024

alexvy86 Sep 24, 2024 •

edited

Loading

alexvy86 Sep 24, 2024

alexvy86 Sep 24, 2024

alexvy86 Sep 25, 2024

alexvy86 Sep 25, 2024

alexvy86 Sep 25, 2024 •

edited

Loading

jason-ha commented Sep 25, 2024

jason-ha left a comment

alexvy86 commented Sep 26, 2024

refactor(ci): Upload stage telemetry separately for each stage #22616

refactor(ci): Upload stage telemetry separately for each stage #22616

Conversation

alexvy86 commented Sep 24, 2024 • edited Loading

Description

Reviewer Guidance

alexvy86 Sep 24, 2024

Choose a reason for hiding this comment

alexvy86 Sep 24, 2024

Choose a reason for hiding this comment

alexvy86 Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

alexvy86 Sep 24, 2024

Choose a reason for hiding this comment

alexvy86 Sep 24, 2024

Choose a reason for hiding this comment

alexvy86 Sep 25, 2024

Choose a reason for hiding this comment

alexvy86 Sep 25, 2024

Choose a reason for hiding this comment

alexvy86 Sep 25, 2024 • edited Loading

Choose a reason for hiding this comment

jason-ha commented Sep 25, 2024

jason-ha left a comment

Choose a reason for hiding this comment

alexvy86 commented Sep 26, 2024

alexvy86 commented Sep 24, 2024 •

edited

Loading

alexvy86 Sep 24, 2024 •

edited

Loading

alexvy86 Sep 25, 2024 •

edited

Loading