Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix persistence and add automatic journal flushing #98

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

JustAnotherArchivist
Copy link
Contributor

This fixes #84.

A new configuration option, flushJournal, controls whether the journal should be flushed to disk on every write. This is enabled by default when using a file-based journal (i.e. when journalFile is not None). It is an error to specify flushJournal = True for the memory-based journal.

To store the currentTerm and the votedForNodeIdHash in the journal, the header had to be expanded, which also means that the journal file format is now version 2. Migration is performed upon restarting with the upgraded code (and there's also a test for this). Regarding details about votedForNodeIdHash (and why this is an MD5 hash), see my recent comment in #84. The FileJournal now also verifies that the journal file version is supported by the code and throws a RuntimeError if that is not the case; previously, it simply always tried to read it as a version 1 file.

If journal flushing is disabled, the node will not take part in elections until 1.1 times the maximum timeout has elapsed. This is necessary to prevent a node from voting twice in the same election, which could lead to two leaders being elected within a single term. It can however lead to a longer delay before the cluster is operational again after failover.

This furthermore fixes a crash in the ResizableFile when performing a big write (compared to the current file size). See commit message of 38d6adf for details.

…e when writing large data

ResizableFile only extended once even when the data to be written to it was larger than that. For example, if the file is currently 1024 B and we're trying to write 1.5 KiB to it, it would get resized to 2048 B and the write would fail (because 2.5 KiB would be required).
The FileJournal only stores the hash of the votedForNodeId because the size of the stored value has to be constant.
Therefore, FileJournal.votedForNodeId does not return the actual node ID but a proxy object which can be compared against the node ID.
@JustAnotherArchivist
Copy link
Contributor Author

Interesting failures. The test_sync_noflush_novote one seems timing-related, so I guess I'll need to increase the timeouts to make it less sensitive. And apparently math.ceil returns a float on Python 2, because of course Python 2 would do annoying things like that. Will fix shortly.

- The test was very sensitive to small timing differences.
  This updated version is still sensitive, but it should be much better now. I was unable to trigger any failures even under high load.
- o1 should be a follower after o2's first vote request due to the higher term in o2's message and its timeout not having been triggered again.
- Python 2's math.ceil returns a float instead of an int for whatever reason.
@JustAnotherArchivist
Copy link
Contributor Author

Alright, I blame the Travis environment. I'm unable to reproduce it on either 2.7 or 3.6 on my machine. Multiple runs always complete successfully. And everything passed on AppVeyor as well.

These failures don't mean that the code is wrong, by the way. Instead, it seems that the test suite's doTicks timeout is not very accurate: when I measured it earlier, it ticked for 3.19 seconds instead of 3.05 as specified through the three calls (the last doTicks call used a timeout of 0.5 s then). This is probably somewhat unavoidable, but it's problematic with tests like this one which is extremely sensitive to timing (the second call has to stop after between 2.5 and 2.64 seconds currently). There's also some overhead of the test code around doTicks, which is obviously not taken into account right now but is also important.

I can think of two workarounds. The first would be to increase the timeouts even more such that the difference between a timeout being triggered by o1 and the vote block time expiring on o2 becomes larger. That would make it less sensitive to those timing variations, but it would also mean that the test takes even longer. The second idea would be to use individual ticks to ensure that the elapsed time is right. The latter is probably a much better solution, and I'll try that.

The doTicks approach did not work well because it isn't very accurate and doesn't take into account the time spent outside of the actual ticking.
So this replaces it with a direct loop which compares against the current time to ensure that exactly the right amount of time has elapsed since the creation of the SyncObj.

This test may break again if the time spent inside SyncObj increases greatly; for example, if __init__ takes very long, the start time would not match the corresponding values inside the objects. Each individual tick may also not take too long.
@coveralls
Copy link

Coverage Status

Coverage increased (+0.5%) to 87.899% when pulling 1bc7246 on JustAnotherArchivist:persistence into b823639 on bakwc:master.

@JustAnotherArchivist
Copy link
Contributor Author

So everything except 3.5 passed on both Travis and AppVeyor, and I still can't reproduce that failure. I'm also not sure what could cause a timing issue with the current code.
I'm giving up for now. If you have any ideas, please let me know.

@bakwc
Copy link
Owner

bakwc commented Nov 17, 2018

Hard week. I'll try to look at this tomorrow. Thanks for PR!

# We can't be sure whether this node voted in the current term yet, so don't vote until maxTimeout has elapsed.
# Factor 1.1 just to be safe.
# Note that an entire non-flushing cluster should always use the same maximum timeout to ensure that this method works!
self.__voteBlockTime = time.time() + 1.1 * self.__conf.raftMaxTimeout
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see any example of using cluster where part of nodes has enabled flushing and part of them not. So there is no need to use this 1.1 constant. Also i'm not sure this gives any guarantee, looks like a hack.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how having such a cluster relates to this change.

You're right that this factor 1.1 is a hack. The entire voteBlockTime thing is a giant hack really. The Raft algorithm requires that the term and vote (and the log) is stored to stable storage before responding to vote requests (or any other message), and using an in-memory journal or disabling flushing violates this requirement.

The Raft algorithm never relies on timing anywhere to guarantee consistency. Clock skew therefore won't affect reliability as long as the algorithm is implemented correctly. But in this particular case, clock skew will affect reliability (because the algorithm's assumptions are violated as mentioned above). For example, if the restarting node's clock is running faster than the rest of the cluster's clocks, it could still vote twice within a term. The factor 1.1 is a hacky fix for this issue in all but the most severe cases (if the clock is running more than 10 % too fast, something's seriously wrong).

In my opinion, using an in-memory journal or not flushing the journal file should not be an option in the first place since it violates a key assumption in Raft. I realise that this has a severe performance impact, but well, that's the cost for reliability. I prefer a slightly slower, reliable solution over a more performant one with nasty timing bugs that are nearly impossible to reproduce or debug.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but may be it would be a good idea at least to move it to a config (instead of hardcoding in-place)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable. I'll do that.

@bakwc
Copy link
Owner

bakwc commented Nov 18, 2018

Looked briefly, left few comments.

pysyncobj/journal.py Show resolved Hide resolved
# Specifically, o2 is expected to ignore the messages until 1.1 * timeout, i.e. including the one sent by o1 after 2.5 seconds, except for updating its term.
# Note that o1 has flushing enabled but o2 doesn't!

def tick(o1, o2, startTime, totalTickTime, sleepTime):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time-specific tests will randomly fails sometimes - you should mock time or not rely on time at all

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, great idea. I'll implement that. 👍

@JustAnotherArchivist
Copy link
Contributor Author

I finally had some time to benchmark this, and it doesn't look good. With flushing disabled, I managed 14k RPS on my test machine (with a request size of 10 and 3 nodes). With flushing enabled, I managed 40. Yes, you read that right: forty instead of fourteen thousand, i.e. about 350 times slower!

This doesn't surprise me too much since flushing is a very expensive operation, although I'd've expected to at least reach a few hundred RPS.

I'm not sure how to improve this. Flushing only every so often doesn't make sense as mentioned in another comment already, and I can't think of anything else than reducing the flushing frequency. If you have any ideas, I'd be happy to hear about them.

As a sidenote, I also tested it with the journal file stored on a tmpfs. Obviously, that doesn't make any sense for actual usage, but it allows for testing the performance of the high-level flushing code. In this test, I'm able to reach the same ~14k RPS. This shows that it's the actual flushing to disk that murders the performance, not my code for calling the flushing. That shouldn't surprise anyone, but I still thought I'd mention it.

I tested this with the existing benchmarking code with a few changes to accomodate the journal file and flushing configuration settings. The patch file is here.

@bakwc
Copy link
Owner

bakwc commented Dec 21, 2018

I think that we can try to flush once per multiple writes. But we shouldn't increment lastCommitIdx untill flush.

@JustAnotherArchivist
Copy link
Contributor Author

There isn't any benefit to that over just disabling flushing entirely though. The Raft spec requires that nodes do not respond to RPCs before persisting those values to disk. So if we only flush after a while, we're still in the same situation as master is now: there is no guarantee that a node hasn't voted, so it has to wait, which introduces timing issues into the cluster.

@bakwc
Copy link
Owner

bakwc commented Dec 22, 2018

There is a batch mode - we already wait a little before sending commands. If we add another 0.1 sec. delay - that will be much better than reducing overall performance.

@JustAnotherArchivist
Copy link
Contributor Author

Ah, so the append_entries messages typically already have multiple entries? In that case, it should be possible to make SyncObj append multiple entries at once to the journal and only then respond to the leader. I'll try that out.

@JustAnotherArchivist
Copy link
Contributor Author

Ok, that helps, but unfortunately the batches seem to be rather small (less than 10 entries in my tests), so it only provides a fairly small improvement. My test machine now manages about 70 RPS instead of 40.
I tried adjusting the appendEntriesPeriod (and the raft{Min,Max}Timeout) to get larger batches, but that didn't improve performance.

@bakwc
Copy link
Owner

bakwc commented Dec 22, 2018

I'm not sure that it's possible to achive using existing batch mode in append_entries. I think you should try following:

  1. Don't call sendNextNodeIdx after receiving append_entries. Just add messages to journal.
  2. Inside journal make a function that will call flush N times per second (eg 5 or 10, configurable)
  3. After journal flushed - call sendNextNodeIdx with last flushed entry.

@JustAnotherArchivist
Copy link
Contributor Author

Yeah. It looks like the Journal.add call in SyncObj._checkCommandsToApply is the bottleneck. After disabling flushing there, I manage 12k RPS. (That's with the batch-addition to the journal in the append_entries message handler in SyncObj.__onMessageReceived.)
I'll need to take a close look at that code to see how I could delay the flushing there.

@fengxuduke
Copy link
Contributor

Any further progress on this issue? I am really keen to see this improvement on RPS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incomplete persistence
4 participants