Test Runs V1 #6757

jaril · 2022-05-13T22:16:27Z

jaril
May 13, 2022

Motivation

When the user is trying to debug a failing test, we aid them with action-playwright. That action records the failing tests so that they can minimize the amount of time to start debugging, and hopefully give them a better debugging experience than they would otherwise without replay.

Each failing test corresponds with a replay, so we're providing one replay's data's worth of value to the user when they go through this flow. But we can do better than that.

A useful thing for the user, if they were in that situation, is being able to compare that failing replay for that PR, with an applicable passing replay. Debugging these two things side-by-side reveal meaningful information that allows the user to identify what unexpected behavior led to the test failure.

That's where test view support come in. Currently, there's no easy way for users to find an applicable replay of a passing test, and compare it to their replay of a failing test. If we did, then users would get more value out of integrating Replay with their CI.

Addressed workflow

The general workflow here is that the user will do some action that triggers a test run. This can be initiated by any of the following:

Push to main and/or PR trigger
Scheduled trigger
Manual trigger

Whenever a test run is kicked off, there's a possibility that the user will be informed that one or more of their tests have failed. As that happens, the user will want to understand why that's happening and be informed enough to decide what the next course of action is (e.g. fix the bug, file an issue, ignore).

Note that even before we make any changes, we already expose the link to the replay of the failing test. Whether that's in the logs, or in the case of a PR, as a comment.

Instead, this work's direct impact will be on what data the user has at hand to start debugging that replay of the failing test. Specifically, exposing replays of a passing test to act as a reference while debugging.

Implementation

We're going to implement two things: a test run view and a test file view.

Test Run View

This is a simple view where we filter the workspace's recordings by its Run ID. It will display the replays for all of the tests that ran.

The entry point for getting into this view is a link that's provided at the point of failure. For the current implementation, that means it will be left as log in the GitHub action itself. Actions related to PRs will additionally leave a comment on the PR itself. Those are the two places where the user can grab a link that will show them the test run view.

This view will show the failing tests first, then passing tests afterwards. For now, the point is not for this view to surface any insights and or data. Instead, it's like a lobby where you then get to decide what to investigate. In most cases, that will mean that the user will see failing tests, and will click into those failing tests.

Test File View

Once a user indicates that they'd like to learn more about a particular failing test, they can click on it and pop open the Test File view for that test. This will be additional information about the Replay, but displayed on the right hand side.

In that right sidebar, we can surface relevant links for the user to resolve this failing test. There'll be a link to the replay of the failing test, as well as a link to a replay of a passing test that's applicable for this debugging scenario:

Push to main/Nightly/run/Manual: Show a good replay from the previous commit to main.
Push to PR: Show a good replay from the previous commit to main before my changes.

Reasoning

This is a total hot take and you're probably thinking, "Jaril that's bananas that's not what we talked about!". Hear me out:

For this workflow, it's only important for the user to be able to grab one good replay of this test that applies for their particular scenario. We don't need a big list, we just need some sort of test file preview that includes a link to that good replay.
When I'm browsing the failed tests in a test run, there's no need for me to drill deep. Imagine that there's three failing tests. I, as the developer, will go into each one to figure out why it's failing. If that investigation required an explicit filtering to get to the replay links I need, I would then have to have a way to revert the filter to go back to the test suite. It's a very heavy handed navigation pattern for something that can be easily handled by a little preview popup on the right sidebar, and more closely aligns with how I would use it as a dev investigating this failing test run.
Yes, there will be a time when our Test File view will have richer data that will surface more information about that test and that might require that more explicit navigation into "I'm looking at this test file!" view. But for now, it's not necessary.

Omitted features

Test Runs view

This seemed to be a given, since every test runner dashboard has a reverse-chronological list of test runs and when they happened. It made sense to me to add it to this proposal initially because of that. But after more thinking, I realized that it's a duplication that has limited usefulness.

If we did have a Test Runs view, it would be a 1:1 copy of what the user would see if they were to simply go to GitHub Actions and click on their test suite. We're not providing any additional value that would make them come to Replay first for that list, instead of GitHub Actions.

There's an argument for still pushing this in since we can display a prettier representation of their test runs as compared to GitHub. We could show the pass/fail numbers for each test. But even then, I don't think that's compelling enough to be prioritized, at least for this version.

There's also a separate, somewhat superficial but more convincing argument in support of a Test Runs view — demos, screenshots, and videos. Following a link from a PR that brings you into Replay is much less sexy than going to Replay first, clicking on a test runs view, picking a test run, and going from there. It's not at all a practical user flow, but I could see the argument for getting it in just so we could have better demoware.

Action view

GitHub Actions have a SHA. Each SHA can have multiple test runs, for example, if a user re-runs the tests because they're flaky. We could explore a view where we filter by the SHA, and show the replays of the tests that correspond to it. But I didn't see any immediate, compelling benefit from that view. Typically, I just care about the latest test run. We can re-evaluate if this is something that's important later on, but for now, I'm happy with foregoing it.

Test file versioning

This would be nice, but generally an edge case and unimportant compared to getting the rest of it this feature right in the first place. It also introduces some more complexity that I'd like to sidestep if possible.

Flakiness Score

This is very important information that's relevant to the user as they're looking at the test run view and figuring out whether they should take a failing test seriously (or not).

In general, it feels like you could get a rough idea of flakiness by taking the last ~100 test runs for that file that belongs to a main branch. But that's a loaded statement — it's possible that the user has only enabled playwright tests for pushes to a PR and we don't have that data. And even if it were, there's an built-in assumption that every test run on the main branch should be passing. Which might be untrue, since it's not unheard of to push code with failing tests to main.

In any case, I could be persuaded that this is actually easier than I'm making it out to be and that we should include it in V1. But I'm erring on the side of caution to keep the core part of this out.

V2 considerations:

Pattern matching across tests in the same test run: Find ways to surface insights about why/where/how tests are breaking in the test run view.
Pattern matching across tests in the same test file: Find ways to surface insights about why/where/how tests are failing in the test file view.
Lifting up the describe blocks and their pass/fail value: Allow the user to skim what part of the test failed without having to open the replay.

Additional notes

How action-playwright works

For a project (repo) that already uses Playwright for e2e tests that runs in GitHub Actions, they can swap in action-playwright.
This records the test run and uploads it to their Replay workspace (Todo: Add a documentation thing).
If the tests don't run on a PR, they can grab the test run replays using the GitHub Action log.
If the tests run on a PR, then there's a bot that automatically leaves a comment on the PR with links to the failing tests.

jasonLaster · 2022-05-16T04:16:06Z

jasonLaster
May 16, 2022
Maintainer

Love this!

Will keep the feedback short because i'm sure we'll discuss in more depth soon.

I think it's important that we start with both modes (Runs, Tests) because they serve as valuable navigational UI. They also serve as a valuable anchor if we want to link to it from a comment or other integration.
I think we should aim to ship a light weight flakiness score that uses a simple heuristic that we can explain. Replay can really help people improve their test quality and it is great to call out when a test is flakey.

1 reply

jaril May 16, 2022
Author

I'll play devil's advocate — a runs view (list of test runs) and tests view (list of test files) is a good visual, but is not actually practically useful. We could have it, but I wouldn't prioritize it on top of a run view (list of tests in a run). Runs view is not useful since it's a duplicate of what you'd see in the github action. And it's not clear to my why a non-curated view of all the runs for a test can provide value, or what workflow/user story it would map to. Happy to keep discussing though since a bulk of the work to implement all of these is happening regardless so it might be trivial to add afterwards.
No arguments against shipping a flakiness score since it's actually valuable. Just a question of when and how. There's a naive-but-works version where we just grab data from the main branch n runs and put together a rough number. But the proper version of this will probably require handling test-retries properly, which we can, it's just takes more time.

jaril · 2022-05-16T20:31:19Z

jaril
May 16, 2022
Author

Recap of call with Jason:

Add a very basic flakiness score but with conditions — it would require the user to have n number of test runs on the main branch, and we would calculate the flakiness based on full green runs / total runs over the last n runs. This is assuming that the user has enabled recording playwright tests on the main branch as well. Flakiness can apply to multiple building blocks, e.g. tests, test files and test runs.
Add a test runs view. The purpose of which is, in the case where branch:main is available, to surface patterns of increased flakiness/failing tests.

0 replies

souporserious · 2022-05-16T23:02:46Z

souporserious
May 16, 2022

This is awesome, great work on this, @jaril! Nothing to add that wasn't already mentioned here and in today's meeting.

Instead, this work's direct impact will be on what data the user has at hand to start debugging that replay of the failing test. Specifically, exposing replays of a passing test to act as a reference while debugging.

So exciting! It will be amazing to easily snag a reference to the last passing replay. I can't wait to have this setup for us 😍

0 replies

ceceliacreates · 2022-05-17T14:26:16Z

ceceliacreates
May 17, 2022

Late to the party but here are my thoughts.

I think that the flow of:

Test run fails
Click into test run view from GHA or PR comment
See the tests that failed in the run
Click into a failing test and see a comparison passing replay

Is an awesome start to the "Debugging" workflow. It provides the minimum data needed to start debugging why the failure occurred.

I also like that we use the sidebar for the Test File view, because it allows the user to stay within the context of the run in the main view while digging in deeper.

Add a very basic flakiness score but with conditions — it would require the user to have n number of test runs on the main branch, and we would calculate the flakiness based on full green runs / total runs over the last n runs. This is assuming that the user has enabled recording playwright tests on the main branch as well. Flakiness can apply to multiple building blocks, e.g. tests, test files and test runs.

This is interesting. So Cypress defined flakiness very specifically. From their docs:

A test is considered to be flaky when it can pass and fail across multiple retry attempts without any code changes.

It sounds like here we are defining flaky as passes and fails over time on the same branch, regardless of other variables (app code changes, test code changes, environment, type of test run, etc.) This actually may align more closely to what devs experience, but we could end up identifying "false positives" of flake if there are underlying app or test code changes.

Ultimately I think an initial flakiness score for Test Suite (run pass/fail) and Test Case (test pass/fail) would be a good starting point.

Both the Flaky Score and Test Runs are in the "Analysis" workflow. I'm not sold on a Test Runs view until we have more value to add for this workflow in terms of highlighting more metadata and trends (agree that it's a duplication of CI/test runner views) and I'm not sure how we would handle navigation into this view. I'm inclined to make sure we can do the "Debugging" workflow well first and better understanding what data would be helpful for an "Analysis" workflow before building. However, if we want to scaffold the view with just fails/flake and then add more metadata later, that would make sense too if there's not too much overhead.

Very excited to see a v1!!!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test Runs V1 #6757

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Test Runs V1 #6757

jaril May 13, 2022

Motivation

Addressed workflow

Implementation

Test Run View

Test File View

Reasoning

Omitted features

Test Runs view

Action view

Test file versioning

Flakiness Score

V2 considerations:

Additional notes

How action-playwright works

Replies: 4 comments · 1 reply

jasonLaster May 16, 2022 Maintainer

jaril May 16, 2022 Author

jaril May 16, 2022 Author

souporserious May 16, 2022

ceceliacreates May 17, 2022

jaril
May 13, 2022

Replies: 4 comments 1 reply

jasonLaster
May 16, 2022
Maintainer

jaril May 16, 2022
Author

jaril
May 16, 2022
Author

souporserious
May 16, 2022

ceceliacreates
May 17, 2022