A fix proposal for a bug in raft/synchro #5853

sergepetrenko · 2021-02-25T13:55:24Z

sergepetrenko
Feb 25, 2021
Maintainer

This is a list of ideas for a fix regarding issue #5445.

Here's a brief bug description, since the original issue doesn't have the necessary information:
Imagine you have N raft nodes, with one leader, denoted A.
Once A dies, other nodes are left with some of A's synchronous transactions, which are neither confirmed nor rolled back.
These transactions might be present on some of alive replicas, so the freshly elected leader, denoted B, issues two log entries: confirm and rollback. These entries confirm everything that is present on B and rollback everything else, should there be such transactions.

Now imagine, A returns after some time. First of all, it receives the CONFIRM and ROLLBACK entries from B, and rolls back the transactions that aren't present on B. Later on A might become the leader again. This means replicas will start accepting replication from A, and the first thing A will send them is these unconfirmed transactions from its WAL.

Now we are in a situation when all the replicas that had all of A's transactions initially have rolled these transactions back, which is correct. But at the same time, all the replicas which had only part of A's transactions receive these transactions later on. So the cluster is in an inconsistent state.

Two more notes:

the same issue described above applies to synchro without raft
A may have asynchronous transactions, which it didn't replicate to other nodes before dying. Then these transactions will show up on every node much later. Once A is back online. It was decided that's ok.

A list of possible solutions to this problem which we have thought of follows.

I. Up to VCLOCK_MAX limbos and no ROLLBACK entries.

The main idea is to never roll entries back. Even if the instance dies, its transactions remain open until it returns back to life and confirms them. Now, to allow everyone else write while one of the instances has open transactions, we introduce 32 limbos. One per every cluster member. This drives leader election obsolete. Anyone can be the leader. Even synchronous multimaster is possible.

UPD: having no rollback entries may lead to a conflict similar to the one we would have in async replication.
There's no way to roll back an instance's tx (without rejoin), so if an instance A writes something (without confirming), then dies, and instance B writes the same data while A is away, A won't be able to connect back, because it'll fail with duplicate data.
This wouldn't happen if ROLLBACKs were allowed: A would first receive a rollback for its data and only then receive duplicate tx from B (which wouldn't be a duplicate already)

Pros:

the problem described above ceases to exist. At least during a normal leader change. Once we say the instance is gone forever and remove it from _cluster table, we still have to either confirm or roll back its pending entries.
This allows for synchronous multimaster replication. Which is cool and noone has that.

Cons:

This is not close to raft in any sense and it makes raft not needed at all.

II. Duplicate ROLLBACK entries.

The problem arises from the fact that ROLLBACK and actual entries that need to be rolled back switch places.
So first goes the ROLLBACK entry and only then come the transactions that were meant to be rolled back.
This particular solution tries to make things right by duplicating every ROLLBACK an instance receives.
So, when A receives a rollback authored by B for A's transactions, A duplicates this ROLLBACK entry, and writes a rollback authored by A, for A's transactions. Now even the replicas that received B's rollback first and A's transactions second will also receive a copy of B's rollback after A's transactions.

Pros:

Fairly simple to implement.
Will fix both synchro and raft problems.

Cons:

This looks like a crutch.

Open questions:
We always discuss a situation when A is gone for some time. What if there's another replica, say C, which is also dead and has A's transactions? When it comes back, all the same problems we had with A will be present.
So looks like ANY instance has to duplicate EVERY rollback it receives, except the ROLLBACKs, which are authored by and refer to the same instance.

III. Limbo ownership judging by CONFIRM N/ROLLBACK N+1 pairs.

Every time a leader changes, it issues a pair of records: CONFIRM N, ROLLBACK N+1
We may introduce a concept of limbo ownership: once some instance has written CONFIRM +ROLLBACK,
it becomes the new limbo owner. Once the limbo owner is known, it becomes the only replication source allowed. Replicas turn everything they receive from other instances to NOPs.

How this is going to help: once A returns, its old synchronous transactions will come before its CONFIRM + ROLLBACK in the replication stream and will get ignored by replicas (turned into NOPs to propagate vclock).

This is somewhat close to the concept of terms in Raft. Which is discussed below.

UPD:

this CONFIRM/ROLLBACK pair may be unified under a new name. Say, "CHANGE_LIMBO_OWNER", or something. This'll be a single atomic entry performing both CONFIRM and ROLLBACK. And maybe setting limbo owner (now empty limbo means anyone can write. We may want to change this)
To monitor which rows should be turned to NOPs applier may hold a separate vclock. It'll hold id: max_valid_lsn pairs.
The vclock is updated each time a CONFIRM/ROLLBACK pair is received. Say, replica with id 3 issues CONFIRM/ROLLBACK for lsn N of replica with id 2. Then replica with id 4, upon receiving this pair will update vclock with {id = 2, max_valid_lsn = N}. Every row with id 2 will then be ignored until 2 issues its own CONFIRM/ROLLBACK pair. Once it does so, 2 is removed from the vclock.

Pros:

Will fix both synchro and raft.

Cons:

Hard to monitor presence of 2 consequent CONFIRM and ROLLBACK entries, issued by one instance, and taking effect on another instance.
This blocks synchronous multimaster. UPD: Maybe not. Synchronous multimaster won't need limbo ownership transition, since there'll be as many limbos as there are instances.

IV. Move closer to RAFT paper in our implementation.

This is close to the previous idea. With some additional details.
Let's make RAFT_TERM a global entry, so that it gets replicated. Now replicas will receive term of each row together with the replication stream. Then the replica will filter rows by term.

We may also decide to exchange terms through normal log replication and not weird asynchronous raft messages we implemented earlier. But this is optional.

Here's how it's going to help. Once A returns, it start feeding replicas rows with old term. Replicas ignore such rows upon receiving them. Once recovery comes to the point where the newest term got persisted, replicas stop ignoring A's rows.

In other words, we should filter rows not only by replication source == leader id, but also by row term >= current instance term.

Pros:

This is exactly what is stated in the original Raft paper. This means there's less chance there are bugs in this particular solution. And it's definitely not a crutch.
This is fairly easy to implement.

Cons:

Only fixes raft. What to do with synchro bug then is an open question.
This will look like a crutch for asynchronous transactions. We needn't filter them. And applier doesn't know whether the row is synchronous or not, but still should filter by this parameter.
UPD: Speaking of sync/async txs, a row may get a flag whether it belongs to a sync tx or not. So it's not hard to filter this in applier.

V. Filter rows on master side and delay relaying until all replicas tell their state.

The idea is to wait until all replicas in _cluster tell their state before starting to relay anything to them. This means until they subscribe, or we subscribe to them. This'll let us know their vclock.
If their vclock is identical to our vclock right after the end of local recovery, this means that no one intervened our limbo, and we may safely relay everything to them as is. Gathering acks for pending transactions (this is not done yet. Master doesn't gather acks after restart, so it's a separate ticket).
If their vclock is greater, than ours, we should delay relaying until we receive CONFIRM N ROLLBACK N+1 from any of them.
Once we receive it, we start relaying with one exception: we do not gather acks anymore, and we replace, upon relaying, rows in range [N+1, M] with NOPs. M here is our lsn at the end of local recovery.

How this fixes the problem: imagine A comes back. The end of its WAL consists of unconfirmed transactions. A mustn't send them to others. Otherwise we would end up in a situation described above many times. Txs would come after ROLLBACK for them.
So A replaces them with NOPs. Vclocks on every node converge, and everyone has the same data.

Pros:

Works for both synchro and raft.

Cons:

Mildly complex

sergos · 2021-02-26T14:05:47Z

sergos
Feb 26, 2021
Maintainer

In the fifth solution I would rephrase "wait until all replicas in _cluster tell their state" into "wait until local recovery limbo is resolved".
And the one of two resolutions possible: no intervention in our limbo -> gather ACKs; CONFIRM+ROLLBACK -> filter relay. I believe the second can happen earlier than we get all replicas connected.

0 replies

sergepetrenko · 2021-03-05T13:22:57Z

sergepetrenko
Mar 5, 2021
Maintainer Author

Here's an implementation plan I've put together after all the discussions we had. It mostly follows the ideas of variants III and IV.

Each row should have an 'is_sync' field. This is needed now to ease the filtering we'll need in applier, and this'll be needed later, once we introduce box.commit{is_sync} option. Then we'll determine whether a transaction is synchronous by looking at its rows rather than looking at spaces it touches.
We change box.ctl.clear_synchro_queue() behaviour.
Now it works as follows: it waits (indefinitely long) until every synchronous transaction this instance has gathers quorum, and then writes CONFIRM for the last synchronous transaction it has.
What we want it to do is write CONFIRM for the last transaction it has and unconditionally write ROLLBACK for everything newer.
We introduce a new WAL entry. Let's denote it BECOME_LEADER(N, T) for now. This entry combines effects of the three entries we already write to WAL separately: CONFIRM LSN N, ROLLBACK LSN N+1, RAFT_TERM T. The entry is written to WAL instead of CONFIRM + ROLLBACK pair once box.ctl.clear_synchro_queue() succeds. The entry is replicated via relay as usual.
Moreover, this entry receives new special meaning. While normal RAFT_TERM entry is used to persist current term in journal and notify other instances of your term bump, this new entry as part of BECOME_LEADER is a command for replicas to stop applying synchronous rows from the former leader. More on that in paragraph 4.
Note, it's possible to get along with writing 3 separate entries: CONFRIM, ROLLBACK and RAFT_TERM, but having a single entry is easier to process from applier side, and this single entry allows to give new meaning to RAFT_TERM while leaving the RAFT_TERM entry without changes.

3.1) (optional) BECOME_LEADER() request pins the limbo to the instance that issued the request. This way even if the limbo is empty, noone will be able to fill it unless issuing another BECOME_LEADER request.

Applier filtering changes. Long story short, applier will only apply rows signed with the latest leader term.
Now, when raft is enabled, replica only applies rows coming from the leader.
We want to change this.
We'll introduce a special map of instance ids and their last known log terms. Last known log term is not the same as current instance term. It's the term written in the last BECOME_LEADER request we received from this instance.
Once the leader is elected, this map gets 1 entry with biggest 'last known log term'. Once this happens, replica stops applying synchronous rows (marked by is_sync flag) from any instance, whose last known log term is less than the leader's one. Replica replaces such rows to NOPs to preserve vclock equality. Moreover, replica doesn't ACK synchronous (and thus any) rows while remote instance's last known log term is not the greatest.
Note, replica continues applying asynchronous rows coming from any instance.
Note, the greatest last known log term is generally the same as current leader's term. The only exception are ongoing elections, when the leader hasn't emerged yet. Then current term is greater than last known log term. This changes once a leader is elected and replica receives its BECOME_LEADER wal entry.
Replica persists BECOME_LEADER requests upon receiving them, and thus persists greatest last known log term.
Manual election concept is introduced.
The idea with BECOME_LEADER request works both with enabled and disabled raft.
Speaking of manual elections, which usually consist of simply calling box.ctl.clear_synchro_queue() on the desired leader, we need to guard the user from calling the function on multiple instances simultaneously.
The function, if called on multiple instances, should make them perform one round of raft elections, and succeed on the elected leader, and fail on other nodes.
This means every instance in cluster should be a raft voter by default, and every nominated instance should be promoted to raft candidate. There are some unclear points in this idea, listed below.

Open questions:

How do we determine which async transactions would enter limbo, and so should be turned to NOPs, and which async transactions should be applied?

Probably, every transaction that'll enter limbo (a normal synchro transaction and an async transaction that simply waits for synchronous transaction confirmation) should be marked as sync in WAL.

For compatibiliy with older instances we might want to replace BECOME_LEADER with 3 rows: CONFIRM, ROLLBACK, RAFT_TERM written strictly one after another. Then this triplet will be processed similarly to BECOME_LEADER, but special handling in applier will be needed.
We should probably change default election mode to voter? Too unclear. Make voters writeable? Or let instances with election_mode=off vote in elections ?

2 replies

Gerold103 Mar 5, 2021
Collaborator

How about we drop box.ctl.clear_synchro_queue() and call it box.ctl.promote(). It would be more consistent with what it does.

sergepetrenko Mar 30, 2021
Maintainer Author

I agree. Let's not drop it though, but rather deprecate.
box.ctl.promote() will be the preferred name, and box.ctl.clear_synchro_queue() will be an alias to that. We already advertised it to the users, so it's wrong to drop it like that, I think.

Also, I named the new WAL entry PROMOTE instead of BECOME_LEADER. Sounds better IMO.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tarantool

A fix proposal for a bug in raft/synchro #5853

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Tarantool

A fix proposal for a bug in raft/synchro #5853

sergepetrenko Feb 25, 2021 Maintainer

I. Up to VCLOCK_MAX limbos and no ROLLBACK entries.

II. Duplicate ROLLBACK entries.

III. Limbo ownership judging by CONFIRM N/ROLLBACK N+1 pairs.

IV. Move closer to RAFT paper in our implementation.

V. Filter rows on master side and delay relaying until all replicas tell their state.

Replies: 2 comments · 2 replies

sergos Feb 26, 2021 Maintainer

sergepetrenko Mar 5, 2021 Maintainer Author

Gerold103 Mar 5, 2021 Collaborator

sergepetrenko Mar 30, 2021 Maintainer Author

sergepetrenko
Feb 25, 2021
Maintainer

Replies: 2 comments 2 replies

sergos
Feb 26, 2021
Maintainer

sergepetrenko
Mar 5, 2021
Maintainer Author

Gerold103 Mar 5, 2021
Collaborator

sergepetrenko Mar 30, 2021
Maintainer Author