-
Notifications
You must be signed in to change notification settings - Fork 2.2k
fix flaky test on mysqlshell backup engine #17037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Renan Rangel <rrangel@slack-corp.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #17037 +/- ##
==========================================
- Coverage 67.41% 67.39% -0.02%
==========================================
Files 1574 1574
Lines 253202 253204 +2
==========================================
- Hits 170686 170647 -39
- Misses 82516 82557 +41 ☔ View full report in Codecov by Sentry. |
This PR is being marked as stale because it has been open for 30 days with no activity. To rectify, you may do any of the following:
If no action is taken within 7 days, this PR will be closed. |
@rvrangel Hi! This test is flaky and impacting us in a bunch of places, do you have some idea on when you can fix it? |
Signed-off-by: Renan Rangel <rrangel@slack-corp.com>
Signed-off-by: Renan Rangel <rrangel@slack-corp.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re-reading the exec
docs, Cmd.Wait()
closes all pipes as soon as the program exits, so there is no way to control and avoid eventual between bufio's Scan()
running and the pipe being closed, making it not very suitable for this use case.
So I changed approaches to use Cmd.Stdout
and Cmd.Stderr
and control the closing manually, as well as giving the 3 goroutines additional time to finish reading before we close the pipes, this seems to have eliminated the issue.
// after that we can process if an error has happened or not. | ||
err = cmd.Run() | ||
|
||
time.Sleep(time.Millisecond * 100) // give the goroutines some time to read any remaining logs before closing pipes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can consistently run this from the main
branch and hit that edge case:
$ go test -run=TestMySQLShellBackupEngine_ExecuteBackup_ReleaseLock -race -count=1000
--- FAIL: TestMySQLShellBackupEngine_ExecuteBackup_ReleaseLock (0.01s)
--- FAIL: TestMySQLShellBackupEngine_ExecuteBackup_ReleaseLock/lock_released_if_we_see_the_mysqlsh_lock_being_acquired (0.00s)
mysqlshellbackupengine_test.go:379:
Error Trace: /home/rrangel/vitess_rvrangel/go/vt/mysqlctl/mysqlshellbackupengine_test.go:379
Error: "I1126 15:18:30.734269 mysqlshellbackupengine.go:117] Starting ExecuteBackup in test\nI1126 15:18:30.734448 mysqlshellbackupengine.go:150] acquiring a global read lock before fetching the executed GTID sets\nI1126 15:18:30.734470 mysqlshellbackupengine.go:169] running /tmp/TestMySQLShellBackupEngine_ExecuteBackup_ReleaseLock892506315/001/test.sh --defaults-file=/dev/null --js -h localhost -e util.dumpInstance(\"logical/tmp/TestMySQLShellBackupEngine_ExecuteBackup_ReleaseLocklock_released_if_we_see_the_mysqlsh_lock_being_acquired531729948/001\", {\"threads\": 4})\nW1126 15:18:30.737258 backup.go:524] error scanning lines from mysqlshell stdout: read |0: file already closed\nW1126 15:18:30.737293 backup.go:524] error scanning lines from mysqlshell stderr: read |0: file already closed\nE1126 15:18:30.737314 mysqlshellbackupengine.go:534] could not release global lock earlier\nI1126 15:18:30.737335 mysqlshellbackupengine.go:203] Writing backup MANIFEST\nI1126 15:18:30.737503 mysqlshellbackupengine.go:243] Backup completed\nI1126 15:18:30.737512 xtrabackupengine.go:164] Closing backup file MANIFEST\n" does not contain "global read lock released after"
Test: TestMySQLShellBackupEngine_ExecuteBackup_ReleaseLock/lock_released_if_we_see_the_mysqlsh_lock_being_acquired
Messages: failed to release the global lock after mysqlsh
FAIL
cleaning up "/tmp/50bd7cc81b5edaf5bdb887a7c4ab3ba5.dat"
exit status 1
FAIL vitess.io/vitess/go/vt/mysqlctl 12.238s
with this PR's branch, I reduced the sleep time to 10ms
to allow me to runs it thousands of times and haven't seen the issue once the Sleep has been added:
$ go test -run=TestMySQLShellBackupEngine_ExecuteBackup_ReleaseLock -race -count=10000
PASS
cleaning up "/tmp/50bd7cc81b5edaf5bdb887a7c4ab3ba5.dat"
ok vitess.io/vitess/go/vt/mysqlctl 448.584s
I have tested this with 10ms
on thousands of tests and didn't see it fail, 100ms
should be safer to avoid the flaky test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should use sleep, why wouldn't it work to wait for the group then? Then the goroutines are done and would have read it all?
In general, a sleep is kinda an anti pattern for concurrent work. If it's needed, there's usually a better way to solve the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem is that closing it too fast once the program finishes doesn't give enough time for bufio
to Scan the last line, so the test doesn't see the lock being released. and if we close it immediately, then there is this race between Scan() trying to read it while the file is closed and it doesn't report the last line.
In the real world this should never really happen as the line we are looking for is printed in the beginning, but it shows up immediately on the mocked test. I can move the sleep to the inside the mocked program, which would avoid having this in the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to merge this since it affects CI so significantly. I still think there's something here fundamentally broken and it needs some rework to remove the actual race, but not going to hold this back here.
Signed-off-by: Renan Rangel <rrangel@slack-corp.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me right now! Thanks for the fix. I am going to restart the CI workflows a bunch of times before merging this to make sure it passes consistently on CI.
@@ -359,7 +356,8 @@ func TestMySQLShellBackupEngine_ExecuteBackup_ReleaseLock(t *testing.T) { | |||
} | |||
|
|||
// this simulates mysql shell completing without any issues. | |||
generateTestFile(t, mysqlShellBackupBinaryName, fmt.Sprintf("#!/bin/bash\n>&2 echo %s", mysqlShellLockMessage)) | |||
generateTestFile(t, be.binaryName, fmt.Sprintf( | |||
"#!/bin/bash\n>&2 echo %s; echo \"backup completed\"; sleep 0.01", mysqlShellLockMessage)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rvrangel This feels still like it works around the race? Is there a way to more fundamentally fix the race? There's afaik no guarantee that here that the actual implementation will take some time before it exits after the last line outputted and we'd lose that last line in logging etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not that I know, but let me know if you have other ideas we could try. the main problem seems to happen when the pipe/file is closed during the Scan() is taking place in the other goroutine and I haven't found a way of avoiding it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rvrangel Fwiw, I've tried running it without the sleep 0.01
in the test program and it doesn't seem to fail? So is that really needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you run the test as I did here (so it runs thousands of times with -race -count=1000
)? If I remove the sleep I do see the errors pop up every now and then when I run it. If you don't, maybe try it with 10k instead.
Signed-off-by: Renan Rangel <rrangel@slack-corp.com> Signed-off-by: garfthoffman <109185460+garfthoffman@users.noreply.github.com>
Signed-off-by: Renan Rangel <rrangel@slack-corp.com>
Signed-off-by: Renan Rangel <rrangel@slack-corp.com> Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Description
This is an attempt to fix the flakiness of the test introduced in #17000
Related Issue(s)
Related to the fix for #17011
https://vitess.slack.com/archives/CFBGH8GUV/p1729457569165799?thread_ts=1729264090.367559&cid=CFBGH8GUV
Checklist
Deployment Notes