-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add semi-sync monitor to unblock primaries blocked on semi-sync ACKs #17763
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Manan Gupta <manan@planetscale.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #17763 +/- ##
==========================================
+ Coverage 67.41% 67.48% +0.06%
==========================================
Files 1592 1593 +1
Lines 258024 258473 +449
==========================================
+ Hits 173948 174424 +476
+ Misses 84076 84049 -27 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Manan Gupta <manan@planetscale.com>
func (w *Watcher) stillBlocked() bool { | ||
w.mu.Lock() | ||
defer w.mu.Unlock() | ||
return w.isOpen && w.isBlocked | ||
} | ||
|
||
// checkAndSetIsWriting checks if the watcher is already writing to the DB. | ||
// If it is not, then it sets the isWriting field and signals the caller. | ||
func (w *Watcher) checkAndSetIsWriting() bool { | ||
w.mu.Lock() | ||
defer w.mu.Unlock() | ||
if w.isWriting { | ||
return false | ||
} | ||
w.isWriting = true | ||
return true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this affects the query hot path, right? If it does, then it might be worth e.g. using 1 byte for the status and using bits in there for isWriting, isBlocked, isOpen etc. so that we can use atomics for reading them, CAS for optional changes, etc. If nothing else, it's probably worth moving these to atomic.Bool so that e.g. checkAndSetIsWriting can be one atomic call:
return w.isWriting(false, true)
It makes the code simpler, clearer, and less contentious / efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think performance is too much of a concern, but the usage of having multiple bool fields behind a mutex vs atomic.Bool I think becomes a matter of preference. I for one, like to have the former because that means that only one boolean value transitions at a point in time, but with atomic bool values it can change even when you've just read that value.
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
…writes Signed-off-by: Manan Gupta <manan@planetscale.com>
…c is blocked on the primary Signed-off-by: Manan Gupta <manan@planetscale.com>
…d for too long and a add a few logs Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
…ger to only start it when required Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
…ake tablet Signed-off-by: Manan Gupta <manan@planetscale.com>
… is blocked Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good! Just had some questions/comments. Let me know what you think.
// be called multiple times in parallel. | ||
func (m *Monitor) checkAndFixSemiSyncBlocked() { | ||
// Check if semi-sync is blocked or not | ||
ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we hardcoding the timeout and not using RemoteOperationTimeout
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use RemoteOperationTimeout here? Its not a remote operation. Its just talking to MySQL. I initially had no timeout whatsoever. Matt suggested I add one, so I added an abritrary one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RemoteOperationTimeout is the standard timeout for all RPC calls. This is an RPC from vttablet to mysqld.
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I only had one minor suggestion. Nice work on this, @GuptaManan100 !
Signed-off-by: Manan Gupta <manan@planetscale.com>
Done, I've made that change too! 💕 |
Description
This PR introduces a new component to the vttablet binary to monitor the semi-sync status of primary vttablets. We've observed cases where a brief network disruption can cause the primary to get stuck indefinitely waiting for semi-sync ACKs. In rare scenarios, this can block reparent operations and render the primary unresponsive. More information can be found in the issues #17709 and #17749.
To address this, the new component continuously monitors the semi-sync status. If the primary becomes stuck on semi-sync ACKs, it generates writes to unblock it. If this fails, VTOrc is notified of the issue and initiates an emergency reparent operation.
A metric for the number of outstanding writes from the semi-sync monitor has also been added. Unfortunately, it's not easy to reproduce the problem in an end-to-end test since it requires setting port forward rules (on Mac) or iptable changes (on Linux), both of which require sudo access. So I've added a test but that test doesn't run on CI. It can only be run locally and the password for the root user has to entered when prompted for it.
The semi-sync monitor is used such that the users who aren't running semi-sync, will not see the monitor open at all. It will only run when semi-sync on the primary is turned on.
Related Issue(s)
Checklist
Deployment Notes