A fix proposal for a bug in raft/synchro #5853
Replies: 2 comments 2 replies
-
In the fifth solution I would rephrase "wait until all replicas in _cluster tell their state" into "wait until local recovery limbo is resolved". |
Beta Was this translation helpful? Give feedback.
-
Here's an implementation plan I've put together after all the discussions we had. It mostly follows the ideas of variants III and IV.
3.1) (optional) BECOME_LEADER() request pins the limbo to the instance that issued the request. This way even if the limbo is empty, noone will be able to fill it unless issuing another BECOME_LEADER request.
Open questions:
|
Beta Was this translation helpful? Give feedback.
-
This is a list of ideas for a fix regarding issue #5445.
Here's a brief bug description, since the original issue doesn't have the necessary information:
Imagine you have N raft nodes, with one leader, denoted A.
Once A dies, other nodes are left with some of A's synchronous transactions, which are neither confirmed nor rolled back.
These transactions might be present on some of alive replicas, so the freshly elected leader, denoted B, issues two log entries: confirm and rollback. These entries confirm everything that is present on B and rollback everything else, should there be such transactions.
Now imagine, A returns after some time. First of all, it receives the CONFIRM and ROLLBACK entries from B, and rolls back the transactions that aren't present on B. Later on A might become the leader again. This means replicas will start accepting replication from A, and the first thing A will send them is these unconfirmed transactions from its WAL.
Now we are in a situation when all the replicas that had all of A's transactions initially have rolled these transactions back, which is correct. But at the same time, all the replicas which had only part of A's transactions receive these transactions later on. So the cluster is in an inconsistent state.
Two more notes:
A list of possible solutions to this problem which we have thought of follows.
I. Up to VCLOCK_MAX limbos and no ROLLBACK entries.
The main idea is to never roll entries back. Even if the instance dies, its transactions remain open until it returns back to life and confirms them. Now, to allow everyone else write while one of the instances has open transactions, we introduce 32 limbos. One per every cluster member. This drives leader election obsolete. Anyone can be the leader. Even synchronous multimaster is possible.
UPD: having no rollback entries may lead to a conflict similar to the one we would have in async replication.
There's no way to roll back an instance's tx (without rejoin), so if an instance A writes something (without confirming), then dies, and instance B writes the same data while A is away, A won't be able to connect back, because it'll fail with duplicate data.
This wouldn't happen if ROLLBACKs were allowed: A would first receive a rollback for its data and only then receive duplicate tx from B (which wouldn't be a duplicate already)
Pros:
Cons:
II. Duplicate ROLLBACK entries.
The problem arises from the fact that ROLLBACK and actual entries that need to be rolled back switch places.
So first goes the ROLLBACK entry and only then come the transactions that were meant to be rolled back.
This particular solution tries to make things right by duplicating every ROLLBACK an instance receives.
So, when A receives a rollback authored by B for A's transactions, A duplicates this ROLLBACK entry, and writes a rollback authored by A, for A's transactions. Now even the replicas that received B's rollback first and A's transactions second will also receive a copy of B's rollback after A's transactions.
Pros:
Cons:
Open questions:
We always discuss a situation when A is gone for some time. What if there's another replica, say C, which is also dead and has A's transactions? When it comes back, all the same problems we had with A will be present.
So looks like ANY instance has to duplicate EVERY rollback it receives, except the ROLLBACKs, which are authored by and refer to the same instance.
III. Limbo ownership judging by CONFIRM N/ROLLBACK N+1 pairs.
Every time a leader changes, it issues a pair of records: CONFIRM N, ROLLBACK N+1
We may introduce a concept of limbo ownership: once some instance has written CONFIRM +ROLLBACK,
it becomes the new limbo owner. Once the limbo owner is known, it becomes the only replication source allowed. Replicas turn everything they receive from other instances to NOPs.
How this is going to help: once A returns, its old synchronous transactions will come before its CONFIRM + ROLLBACK in the replication stream and will get ignored by replicas (turned into NOPs to propagate vclock).
This is somewhat close to the concept of terms in Raft. Which is discussed below.
UPD:
The vclock is updated each time a CONFIRM/ROLLBACK pair is received. Say, replica with id 3 issues CONFIRM/ROLLBACK for lsn N of replica with id 2. Then replica with id 4, upon receiving this pair will update vclock with {id = 2, max_valid_lsn = N}. Every row with id 2 will then be ignored until 2 issues its own CONFIRM/ROLLBACK pair. Once it does so, 2 is removed from the vclock.
Pros:
Cons:
IV. Move closer to RAFT paper in our implementation.
This is close to the previous idea. With some additional details.
Let's make RAFT_TERM a global entry, so that it gets replicated. Now replicas will receive term of each row together with the replication stream. Then the replica will filter rows by term.
We may also decide to exchange terms through normal log replication and not weird asynchronous raft messages we implemented earlier. But this is optional.
Here's how it's going to help. Once A returns, it start feeding replicas rows with old term. Replicas ignore such rows upon receiving them. Once recovery comes to the point where the newest term got persisted, replicas stop ignoring A's rows.
In other words, we should filter rows not only by replication source == leader id, but also by row term >= current instance term.
Pros:
Cons:
UPD: Speaking of sync/async txs, a row may get a flag whether it belongs to a sync tx or not. So it's not hard to filter this in applier.
V. Filter rows on master side and delay relaying until all replicas tell their state.
The idea is to wait until all replicas in _cluster tell their state before starting to relay anything to them. This means until they subscribe, or we subscribe to them. This'll let us know their vclock.
If their vclock is identical to our vclock right after the end of local recovery, this means that no one intervened our limbo, and we may safely relay everything to them as is. Gathering acks for pending transactions (this is not done yet. Master doesn't gather acks after restart, so it's a separate ticket).
If their vclock is greater, than ours, we should delay relaying until we receive CONFIRM N ROLLBACK N+1 from any of them.
Once we receive it, we start relaying with one exception: we do not gather acks anymore, and we replace, upon relaying, rows in range [N+1, M] with NOPs. M here is our lsn at the end of local recovery.
How this fixes the problem: imagine A comes back. The end of its WAL consists of unconfirmed transactions. A mustn't send them to others. Otherwise we would end up in a situation described above many times. Txs would come after ROLLBACK for them.
So A replaces them with NOPs. Vclocks on every node converge, and everyone has the same data.
Pros:
Cons:
Beta Was this translation helpful? Give feedback.
All reactions