Skip to content

fix flaky test on mysqlshell backup engine #17037

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Dec 9, 2024

Conversation

rvrangel
Copy link
Contributor

@rvrangel rvrangel commented Oct 22, 2024

Description

This is an attempt to fix the flakiness of the test introduced in #17000

Related Issue(s)

Related to the fix for #17011

https://vitess.slack.com/archives/CFBGH8GUV/p1729457569165799?thread_ts=1729264090.367559&cid=CFBGH8GUV

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>
Copy link
Contributor

vitess-bot bot commented Oct 22, 2024

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Oct 22, 2024
@github-actions github-actions bot added this to the v22.0.0 milestone Oct 22, 2024
Copy link

codecov bot commented Oct 22, 2024

Codecov Report

Attention: Patch coverage is 95.65217% with 1 line in your changes missing coverage. Please review.

Project coverage is 67.39%. Comparing base (1f23496) to head (c5cce7b).
Report is 32 commits behind head on main.

Files with missing lines Patch % Lines
go/vt/mysqlctl/mysqlshellbackupengine.go 95.65% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #17037      +/-   ##
==========================================
- Coverage   67.41%   67.39%   -0.02%     
==========================================
  Files        1574     1574              
  Lines      253202   253204       +2     
==========================================
- Hits       170686   170647      -39     
- Misses      82516    82557      +41     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

This PR is being marked as stale because it has been open for 30 days with no activity. To rectify, you may do any of the following:

  • Push additional commits to the associated branch.
  • Remove the stale label.
  • Add a comment indicating why it is not stale.

If no action is taken within 7 days, this PR will be closed.

@github-actions github-actions bot added Stale Marks PRs as stale after a period of inactivity, which are then closed after a grace period. and removed Stale Marks PRs as stale after a period of inactivity, which are then closed after a grace period. labels Nov 22, 2024
@frouioui
Copy link
Member

@rvrangel Hi! This test is flaky and impacting us in a bunch of places, do you have some idea on when you can fix it?

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>
Signed-off-by: Renan Rangel <rrangel@slack-corp.com>
Copy link
Contributor Author

@rvrangel rvrangel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-reading the exec docs, Cmd.Wait() closes all pipes as soon as the program exits, so there is no way to control and avoid eventual between bufio's Scan() running and the pipe being closed, making it not very suitable for this use case.

So I changed approaches to use Cmd.Stdout and Cmd.Stderr and control the closing manually, as well as giving the 3 goroutines additional time to finish reading before we close the pipes, this seems to have eliminated the issue.

// after that we can process if an error has happened or not.
err = cmd.Run()

time.Sleep(time.Millisecond * 100) // give the goroutines some time to read any remaining logs before closing pipes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can consistently run this from the main branch and hit that edge case:

$ go test -run=TestMySQLShellBackupEngine_ExecuteBackup_ReleaseLock -race -count=1000
--- FAIL: TestMySQLShellBackupEngine_ExecuteBackup_ReleaseLock (0.01s)
    --- FAIL: TestMySQLShellBackupEngine_ExecuteBackup_ReleaseLock/lock_released_if_we_see_the_mysqlsh_lock_being_acquired (0.00s)
        mysqlshellbackupengine_test.go:379:
            	Error Trace:	/home/rrangel/vitess_rvrangel/go/vt/mysqlctl/mysqlshellbackupengine_test.go:379
            	Error:      	"I1126 15:18:30.734269 mysqlshellbackupengine.go:117] Starting ExecuteBackup in test\nI1126 15:18:30.734448 mysqlshellbackupengine.go:150] acquiring a global read lock before fetching the executed GTID sets\nI1126 15:18:30.734470 mysqlshellbackupengine.go:169] running /tmp/TestMySQLShellBackupEngine_ExecuteBackup_ReleaseLock892506315/001/test.sh --defaults-file=/dev/null --js -h localhost -e util.dumpInstance(\"logical/tmp/TestMySQLShellBackupEngine_ExecuteBackup_ReleaseLocklock_released_if_we_see_the_mysqlsh_lock_being_acquired531729948/001\", {\"threads\": 4})\nW1126 15:18:30.737258 backup.go:524] error scanning lines from mysqlshell stdout: read |0: file already closed\nW1126 15:18:30.737293 backup.go:524] error scanning lines from mysqlshell stderr: read |0: file already closed\nE1126 15:18:30.737314 mysqlshellbackupengine.go:534] could not release global lock earlier\nI1126 15:18:30.737335 mysqlshellbackupengine.go:203] Writing backup MANIFEST\nI1126 15:18:30.737503 mysqlshellbackupengine.go:243] Backup completed\nI1126 15:18:30.737512 xtrabackupengine.go:164] Closing backup file MANIFEST\n" does not contain "global read lock released after"
            	Test:       	TestMySQLShellBackupEngine_ExecuteBackup_ReleaseLock/lock_released_if_we_see_the_mysqlsh_lock_being_acquired
            	Messages:   	failed to release the global lock after mysqlsh
FAIL
cleaning up "/tmp/50bd7cc81b5edaf5bdb887a7c4ab3ba5.dat"
exit status 1
FAIL	vitess.io/vitess/go/vt/mysqlctl	12.238s

with this PR's branch, I reduced the sleep time to 10ms to allow me to runs it thousands of times and haven't seen the issue once the Sleep has been added:

$ go test -run=TestMySQLShellBackupEngine_ExecuteBackup_ReleaseLock -race -count=10000
PASS
cleaning up "/tmp/50bd7cc81b5edaf5bdb887a7c4ab3ba5.dat"
ok  	vitess.io/vitess/go/vt/mysqlctl	448.584s

I have tested this with 10ms on thousands of tests and didn't see it fail, 100ms should be safer to avoid the flaky test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should use sleep, why wouldn't it work to wait for the group then? Then the goroutines are done and would have read it all?

In general, a sleep is kinda an anti pattern for concurrent work. If it's needed, there's usually a better way to solve the problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem is that closing it too fast once the program finishes doesn't give enough time for bufio to Scan the last line, so the test doesn't see the lock being released. and if we close it immediately, then there is this race between Scan() trying to read it while the file is closed and it doesn't report the last line.

In the real world this should never really happen as the line we are looking for is printed in the beginning, but it shows up immediately on the mocked test. I can move the sleep to the inside the mocked program, which would avoid having this in the code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to merge this since it affects CI so significantly. I still think there's something here fundamentally broken and it needs some rework to remove the actual race, but not going to hold this back here.

Signed-off-by: Renan Rangel <rrangel@slack-corp.com>
@rvrangel rvrangel marked this pull request as ready for review November 27, 2024 03:59
@deepthi deepthi requested a review from frouioui November 28, 2024 00:33
Copy link
Member

@frouioui frouioui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me right now! Thanks for the fix. I am going to restart the CI workflows a bunch of times before merging this to make sure it passes consistently on CI.

@frouioui frouioui added Type: Testing Component: Backup and Restore Do Not Merge Flakes and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Nov 28, 2024
@@ -359,7 +356,8 @@ func TestMySQLShellBackupEngine_ExecuteBackup_ReleaseLock(t *testing.T) {
}

// this simulates mysql shell completing without any issues.
generateTestFile(t, mysqlShellBackupBinaryName, fmt.Sprintf("#!/bin/bash\n>&2 echo %s", mysqlShellLockMessage))
generateTestFile(t, be.binaryName, fmt.Sprintf(
"#!/bin/bash\n>&2 echo %s; echo \"backup completed\"; sleep 0.01", mysqlShellLockMessage))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rvrangel This feels still like it works around the race? Is there a way to more fundamentally fix the race? There's afaik no guarantee that here that the actual implementation will take some time before it exits after the last line outputted and we'd lose that last line in logging etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not that I know, but let me know if you have other ideas we could try. the main problem seems to happen when the pipe/file is closed during the Scan() is taking place in the other goroutine and I haven't found a way of avoiding it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rvrangel Fwiw, I've tried running it without the sleep 0.01 in the test program and it doesn't seem to fail? So is that really needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you run the test as I did here (so it runs thousands of times with -race -count=1000)? If I remove the sleep I do see the errors pop up every now and then when I run it. If you don't, maybe try it with 10k instead.

@dbussink dbussink merged commit f746c48 into vitessio:main Dec 9, 2024
99 of 108 checks passed
@dbussink dbussink added the Backport to: release-21.0 Needs to be backport to release-21.0 label Dec 9, 2024
garfthoffman pushed a commit to github/vitess-gh that referenced this pull request Jan 27, 2025
Signed-off-by: Renan Rangel <rrangel@slack-corp.com>
Signed-off-by: garfthoffman <109185460+garfthoffman@users.noreply.github.com>
frouioui pushed a commit to planetscale/vitess that referenced this pull request Mar 18, 2025
Signed-off-by: Renan Rangel <rrangel@slack-corp.com>
frouioui pushed a commit to planetscale/vitess that referenced this pull request Mar 18, 2025
Signed-off-by: Renan Rangel <rrangel@slack-corp.com>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants