Production - [Alerting] Android emulator failure rate alert #11605

dotnet-eng-status · 2022-11-12T03:26:20Z

💔 Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

FailureRate {Machine=API open / a001OKF} 90
FailureRate {Machine=API open / a001OKG} 87
FailureRate {Machine=API open / a001OKH} 93
FailureRate {Machine=API open / a001OKI} 88
FailureRate {Machine=API open / a001OKK} 93
FailureRate {Machine=API open / a001OKN} 83
FailureRate {Machine=API open / a001OKO} 83
FailureRate {Machine=API open / a001OKW} 83
FailureRate {Machine=API open / a001OL7} 85
FailureRate {Machine=API open / a001OL9} 83
FailureRate {Machine=API open / a001OLA} 83
FailureRate {Machine=API open / a001OLB} 92
FailureRate {Machine=API open / a001OLD} 100
FailureRate {Machine=API open / a001OLE} 100
FailureRate {Machine=API open / a001OLF} 100
FailureRate {Machine=API open / a001OLG} 100
FailureRate {Machine=API open / a001OLH} 100
FailureRate {Machine=API open / a001OLI} 100
FailureRate {Machine=API open / a001OLJ} 100
FailureRate {Machine=API open / a001OLK} 100
FailureRate {Machine=API open / a001OLL} 100
FailureRate {Machine=API open / a001OLM} 100
FailureRate {Machine=API open / a001OLN} 100
FailureRate {Machine=API open / a001OLO} 100
FailureRate {Machine=API open / a001OLP} 100
FailureRate {Machine=API open / a001OLQ} 100
FailureRate {Machine=API open / a001OLR} 100
FailureRate {Machine=API open / a001OLS} 100
FailureRate {Machine=API open / a001OLT} 100
FailureRate {Machine=API open / a001OLU} 100
FailureRate {Machine=API open / a001OLV} 100
FailureRate {Machine=API open / a001OLY} 100
FailureRate {Machine=API open / a001OM1} 100

Go to rule

@dotnet/dnceng, please investigate

Automation information below, do not change

Grafana-Automated-Alert-Id-e38f14fe3367451d8de43da6e2453fdd

dotnet-eng-status · 2022-11-12T15:27:46Z

💔 Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

FailureRate {Machine=API open / a001OKI} 92
FailureRate {Machine=API open / a001OKK} 81
FailureRate {Machine=API open / a001OL7} 88
FailureRate {Machine=API open / a001OLF} 100
FailureRate {Machine=API open / a001OLH} 100
FailureRate {Machine=API open / a001OLO} 100
FailureRate {Machine=API open / a001OLP} 90
FailureRate {Machine=API open / a001OLQ} 100
FailureRate {Machine=API open / a001OLU} 84
FailureRate {Machine=API open / a001OLV} 100
FailureRate {Machine=API open / a001OLZ} 92
FailureRate {Machine=API open / a001OM1} 100
FailureRate {Machine=API open / a001OMB} 94

Go to rule

dotnet-eng-status · 2022-11-13T03:42:18Z

💔 Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

FailureRate {Machine=API open / a001OKI} 92
FailureRate {Machine=API open / a001OKK} 81
FailureRate {Machine=API open / a001OL7} 88
FailureRate {Machine=API open / a001OLF} 100
FailureRate {Machine=API open / a001OLH} 100
FailureRate {Machine=API open / a001OLO} 100
FailureRate {Machine=API open / a001OLP} 90
FailureRate {Machine=API open / a001OLQ} 100
FailureRate {Machine=API open / a001OLU} 84
FailureRate {Machine=API open / a001OLV} 100
FailureRate {Machine=API open / a001OLZ} 92
FailureRate {Machine=API open / a001OM1} 100
FailureRate {Machine=API open / a001OMB} 94

Go to rule

dotnet-eng-status · 2022-11-13T15:56:49Z

💔 Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

FailureRate {Machine=API open / a001OKI} 92
FailureRate {Machine=API open / a001OKK} 81
FailureRate {Machine=API open / a001OL7} 88
FailureRate {Machine=API open / a001OLF} 100
FailureRate {Machine=API open / a001OLH} 100
FailureRate {Machine=API open / a001OLO} 100
FailureRate {Machine=API open / a001OLP} 90
FailureRate {Machine=API open / a001OLQ} 100
FailureRate {Machine=API open / a001OLU} 84
FailureRate {Machine=API open / a001OLV} 100
FailureRate {Machine=API open / a001OLZ} 92
FailureRate {Machine=API open / a001OM1} 100
FailureRate {Machine=API open / a001OMB} 94

Go to rule

dotnet-eng-status · 2022-11-14T04:11:22Z

💔 Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

FailureRate {Machine=API open / a001OKI} 92
FailureRate {Machine=API open / a001OKK} 81
FailureRate {Machine=API open / a001OL7} 88
FailureRate {Machine=API open / a001OLF} 100
FailureRate {Machine=API open / a001OLH} 100
FailureRate {Machine=API open / a001OLO} 100
FailureRate {Machine=API open / a001OLP} 90
FailureRate {Machine=API open / a001OLQ} 100
FailureRate {Machine=API open / a001OLU} 84
FailureRate {Machine=API open / a001OLV} 100
FailureRate {Machine=API open / a001OLZ} 92
FailureRate {Machine=API open / a001OM1} 100
FailureRate {Machine=API open / a001OMB} 94

Go to rule

premun · 2022-11-14T09:37:56Z

A massive spike happened during the weekend. I will check whether this comes from a specific PR

premun · 2022-11-14T10:00:12Z

These are the sources of the problematic work items

Source	JobName	Pipeline	Stage	count_
ci/public/dotnet/runtime/refs/heads/release/7.0	5ec835ed-bbe2-437a-8a00-07f9af9f9d45	runtime-extra-platforms	build_Android_x64_Release_AllSubsets_Mono_RuntimeTests_Interp	1904
ci/public/dotnet/runtime/refs/heads/release/7.0	15af78cc-43ec-4ff4-8c24-45e3d49f08f4	runtime-extra-platforms	build_Android_x64_Release_AllSubsets_Mono_RuntimeTests	1770
ci/public/dotnet/runtime/refs/heads/release/7.0	f45ab1d6-cd98-4738-836b-9af4778e2b4c	runtime-extra-platforms	build_Android_x64_Release_AllSubsets_Mono	451
ci/public/dotnet/runtime/refs/heads/release/7.0	39504e54-9ed6-433f-aedf-31999837ce70	runtime-extra-platforms	build_Android_arm64_Release_AllSubsets_Mono	442

The run: https://dev.azure.com/dnceng-public/public/_build/results?buildId=81161&view=results

The query:

Metrics
| where EventType == "_MobileDeviceOperation" and MetricName == "ExitCode" and Timestamp > now()-3d
| extend Dimensions = parse_json(Dimensions)
| extend
    ExitCode = MetricValue,
    Command = tostring(Dimensions.command),
    Platform = tostring(Dimensions.platform),
    Target = tostring(Dimensions.target)
| where Platform == "android" and ExitCode == 80 and Target != ""
| join kind=inner Jobs on JobId
| extend Props = parse_json(Properties)
| extend
    Stage=tostring(Props['System.PhaseName']),
    Pipeline=tostring(Props.DefinitionName)    
| summarize count() by Source, JobName, Pipeline, Stage
| order by count_ desc

premun · 2022-11-14T10:01:56Z

Doesn't seem like an infra issue. The exit code is APP_CRASH and the logs hint the same:

[04:13:01] info: Running instrumentation class net.dot.MonoRunner took 20.2076418 seconds
[04:13:01] dbug: Exit code: 0
                 Std out:
                 INSTRUMENTATION_RESULT: shortMsg=Process crashed.
                 INSTRUMENTATION_CODE: 0
                 
                 
                 
[04:13:01] info: Short message:
                 Process crashed.
[04:13:01] fail: No value for 'return-code' provided in instrumentation result. This may indicate a crashed test (see log)

premun · 2022-11-14T10:15:03Z

@akoeplinger I might have found something possibly interesting to look into. There was a spike on Friday->Saturday night, Redmond evening, when the runtime-extra-platforms run for the release/7.0 branch failed quite spectacularly. We should watch the next build but something might be broken a lot.

This alert never fires and there are no infra failures by the 3rd attempt for Android emulators (like ever). Literally, this line graph is not a line graph as there are no data points except for Saturday:

(also see screenshots above)

All work items end with APP_CRASH with logs as shown above.

akoeplinger · 2022-11-14T12:35:32Z

I checked a few of these failures and they all crash with this in the adb log:

11-11 18:46:17.959 19813 19830 D DOTNET  : ((null) warning) Process terminated.
11-11 18:46:17.959 19813 19830 D DOTNET  : ((null) warning) Encountered infinite recursion while looking up resource 'Arg_EntryPointNotFoundException' in System.Private.CoreLib. Verify the installation of .NET is complete and does not need repairing, and that the state of the process has not become corrupted.
11-11 18:46:17.959 19813 19830 F libc    : Fatal signal 6 (SIGABRT), code -1 (SI_QUEUE) in tid 19830 (.dot.MonoRunner), pid 19813 (ot.Common.Tests)

akoeplinger · 2022-11-14T12:39:54Z

This is almost certainly because of dotnet/runtime#78018
We don't seem to see this issue in main though, so probably some 7.0 specific issue.

akoeplinger · 2022-11-14T12:47:10Z

@premun I've opened a PR to revert the problematic change, thanks for catching it!

tkapin · 2022-11-14T15:31:07Z

Thanks for investigating this @premun and taking action @akoeplinger. Also, happy to see our mobile telemetry helping to identify a product issue.

dotnet-eng-status bot added Active Alert Issues from Grafana alerts that are now active Critical Ops - First Responder Grafana Alert Issues opened by Grafana Production Tied to the Production environment (as opposed to Staging) labels Nov 12, 2022

premun self-assigned this Nov 14, 2022

premun closed this as completed Nov 14, 2022

premun mentioned this issue Nov 14, 2022

Production - [Alerting] Android emulator failure rate alert #11620

Closed

premun mentioned this issue Dec 5, 2022

Handle faulty TCP connection between XHarness and Apple devices #11700

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Production - [Alerting] Android emulator failure rate alert #11605

Production - [Alerting] Android emulator failure rate alert #11605

dotnet-eng-status bot commented Nov 12, 2022

dotnet-eng-status bot commented Nov 12, 2022

dotnet-eng-status bot commented Nov 13, 2022

dotnet-eng-status bot commented Nov 13, 2022

dotnet-eng-status bot commented Nov 14, 2022

premun commented Nov 14, 2022

premun commented Nov 14, 2022 •

edited

Loading

premun commented Nov 14, 2022

premun commented Nov 14, 2022

akoeplinger commented Nov 14, 2022 •

edited

Loading

akoeplinger commented Nov 14, 2022

akoeplinger commented Nov 14, 2022

tkapin commented Nov 14, 2022

Production - [Alerting] Android emulator failure rate alert #11605

Production - [Alerting] Android emulator failure rate alert #11605

Comments

dotnet-eng-status bot commented Nov 12, 2022

dotnet-eng-status bot commented Nov 12, 2022

dotnet-eng-status bot commented Nov 13, 2022

dotnet-eng-status bot commented Nov 13, 2022

dotnet-eng-status bot commented Nov 14, 2022

premun commented Nov 14, 2022

premun commented Nov 14, 2022 • edited Loading

premun commented Nov 14, 2022

premun commented Nov 14, 2022

akoeplinger commented Nov 14, 2022 • edited Loading

akoeplinger commented Nov 14, 2022

akoeplinger commented Nov 14, 2022

tkapin commented Nov 14, 2022

premun commented Nov 14, 2022 •

edited

Loading

akoeplinger commented Nov 14, 2022 •

edited

Loading