Releases: jepsen-io/jepsen
v0.3.7
A very small bugfix release. Jepsen.history incorrectly threw IndexOutOfBoundsException when asked for an out-of-bounds index with a default value. Clojure's semantics are to return the default value, which we now match. This fixes specific kinds of destructuring bind on empty histories.
v0.3.6
This is a sizeable release. It includes a significant correctness bugfix for a rare condition that could make operations in the history print with the wrong data. It also adds a new namespace for composing databases, nemeses, and generators when working with systems where each node has a different role. Kafka-style tests gain new powers and are significantly faster. And we have the usual slew of small bugfixes, dependency bumps, and quality-of-life improvements. Happy testing!
Bugfixes
generator/fill-in-map
no longer generates Ops with duplicate fields in their record extmaps. This fixes a rare bug where operations which used extra fields could wind up with two different values for (e.g.)(:value op)
vs(pprint op)
. It should also improve speed and size on disk.checker.perf/with-range
: fix a bug causing plots with zero data points to convert the plot to a string. This was expensive if the plot is large, and caused very confusing error messages. We now provide a short string message instead.net/iptables
now handles the new error message fromtc qdisc
when callingnet/fast!
API Changes
control/exec
will now throw a:nonzero-exit
error when an exit status code isnil
. Yes, this is apparently a thing that's possible.generator.test/with-fixed-rand-nth
has been replaced bywith-fixed-rands
, which controls rand, rand-int, and rand-nth.tests/kafka
: failed and info operations are now assumed to roll back consumer positions, rather than advancing them.tests/kafka
: emit subscribe/assign ops only 1:64 ops, rather than 1:8. Now tunable via(:sub-p test)
.
New Features
- A new namespace,
jepsen.role
, supports systems where different nodes run different software. db/map-test
wraps a DB in another which alters the test map. Helpful for composing DBs together which expect different things from their test maps.generator/each-process
: likeeach-thread
, this facets an underlying generator into a distinct one for each process.tests/kafka
checks for transactions which read their own writes prior to commit.
Minor Changes
os/centos
now uses dpkg 1.19.8control.net/ip*
now prefers v4 addressescontrol/on-nodes
no longer spawns a future when given a single node--slightly more efficient.- SSHJ now falls back to other auth methods after an AgentProxyException
generator/map
andf-map
now returnnil
when given anil
generator, which simplifies some before-run checks.generator.test/default-test
now includes a pair of:nodes
, for generators that use nodeschecker/check-safe
now writes exceptions as data to the:error
field of the results, rather than an unreadable string stacktracetests.kafka
now detects all duplicates even when given inconsistent offsets. It's nice to have both, it turns out.tests/kafka
includes an:unseen
key in poll operations to help operators track how far behind we aretests/kafka
: new tests for the checker & generatortests/kafka
: duplicate errors now include specific offsetstests/kafka
: inconsistent-offsets errors now emit sorted sets, for readabilitytests/kafka
is roughly 8x faster now, thanks to a slew of performance improvementstests/kafka
also ignores the new cycle-exists variants of G0, G1c, etc.- Jepsen's internal tests log less noise now
- Clojure 1.12.0
- tools.logging 1.3.0
- tools.cli 1.1.230
- unilog 0.7.32
- elle 0.2.2
- http-kit 2.8.0
- ring 1.12.2
- sshj 0.39.0
- data.codec 0.2.0
- data.fressian 1.1.0
Full Changelog: v0.3.5...v0.3.6
0.3.5
This is a relatively small release. It incorporates a new version of Elle which brings dramatic performance improvements, and has a few quality-of-life improvements.
Bugfixes
- os.debian/install and remove now acquire locks, which means you can do multiple debian-package-affecting operations concurrently against a single node.
- Thread/sleep callsites now have explicit integer coercion, fixing a crashes in newer JVMs/Clojure.
API Changes
- control.util/grepkill! now matches the full pattern of the process, rather than just the first 15 characters. This is particularly helpful for killing, say, one out of several
java
processes. - net/Net has been moved to net.proto/Net. All its functions are still available in jepsen.net too.
New Features
- checker.perf: nemesis specifications can now include
:hidden? true
, which prevents them from appearing on graphs.
Minor Changes
- fs-cache/deploy-remote! returns the remote path it uploaded, making it easier to thread into other expressions.
- control.util/install-archive! now has docs that explain the use of file:// URLs.
- net/drop! now uses existing SSH connections, making it faster
- net/drop! has a clearer docstring
- Elle 0.2.1
- SSHJ 0.38.0
- Ring 1.11.0
0.3.4
This is a small bugfix & performance release. Just a little faster, a little more correct, a little easier to use. :-)
Bugfixes
- control.util/await-tcp-port no longer logs a truncated error message
- tests.kafka no longer crashes when checking histories without any received messages
- jepsen.independent's generators properly unlift the :value fields of the operations they pass through to underlying generators
Removals
- os.debian/install-jdk8! is gone now. The repos it relies on haven't worked in years.
Minor changes
- independent/checker uses a concurrent fold for breaking apart histories in fewer passes. This roughly doubles throughput in tests with lots of independent keys.
- independent/checker now returns results in a sorted map, which is easier to read
- generator.interpreter-test now tests matching open/close! invocations
- lazyfs version 0.2.0
- store.format logs more informative errors when serialization fails. You'll get a path to the specific element that couldn't be serialized, as well as its class. This makes serializing tests with new datatypes much less frustrating.
v0.3.3
This release updates Jepsen to run with Debian Bookworm. It also includes performance improvements aimed at testing large histories. Jepsen can run and check histories of up to a billion operations now.
Significant API Changes
- During test setup,
(:generator test)
is now wrapped in a newForgettable
reference type. You can deref this if you want access to the generator for some reason, but be aware that retaining the head of the generator often causes linear memory consumption during the test. - After the test starts generating operations, attempting to deref
(:generator test)
will throw. - core/run-case! and generator.interpreter/run! now return tests, rather than just histories.
Performance Improvements
- Jepsen no longer retains the head of the generator. This dramatically improves memory consumption on long-running tests: running tests of a billion operations in a 512 MB heap is entirely reasonable.
- Elle 0.1.7 comes with significant speed and memory improvements to the G1A, G1b, and internal checkers for list-append and rw-register.
- Jepsen.history is much faster to execute folds on large (e.g. 100+ million op) histories. We converted a quadratic-time loop to linear.
Bugfixes
- control.net/ip filters out loopback interfaces. Bookworm started returning 127.x.x.x interfaces from
getent ahosts
on some platforms.
Minor Changes
- os.debian now refers to the new
netcat
package name. - nemesis.time allows ntpdate to fail during setup/teardown (which now happens on Bookworm).
- nemesis.time no longer tries to use
ntpdate -p
, which is deprecated in Bookworm - tools.cli 1.0.219
- elle 0.1.7
- jepsen.history 0.1.1
- http-kit 2.7.0
Full Changelog: v0.3.2...v0.3.3
0.3.2
This is a relatively small release with a few minor bugfixes and tweaks.
Bugfixes
control.util/wget!
correctly throws exceptions when encountering unrecoverable failures.generator.context
Contexts now correctly handleassoc
.
New Features
client/timeout
wraps an existing client in a new one that times out all operations after some time.
Minor Changes
net/net-dev
usesip
, rather than/sys/class/net
, for identifying network interfaces- Ring 1.10.0
- SSHJ 0.35.0
0.3.1
0.3.0
This release replaces many of Jepsen's internals with faster or more scalable data structures. It introduces significant new datatypes and adds new support libraries. Core generators are much faster, thanks to new Context and Op types. Running and analyzing tests can be 1-2 orders of magnitude faster: Jepsen can now run list-append tests at ~45,000 ops/sec and check them at ~30,000 ops/sec. Histories are streamed and loaded incrementally, which improves crash recovery, allows for histories larger than RAM, and speeds up REPL work. Histories in the hundreds of millions or even billions of operations are now tractable. Most checkers are parallelized and take advantage of sophisticated multi-query optimization for reductions over histories. A new dependency-aware executor allows checkers to run in parallel without starvation. New nemesis.combined
packages support file truncation and bitflips, as well as network latency and packet loss.
As usual, most things should be API compatible, and we try to issue Obvious Warnings when they're not--but this is a big enough change that we're bumping the minor version from 0.2.7 to 0.3.0. Users integrating tightly with histories and generators should test their code carefully.
New Features
- A new library, jepsen.history, provides support for writing efficient checkers. It includes a transactional dependency-aware concurrent executor, concurrent and linear folds with multi-query optimization, and lazy datatypes for working with large histories.
- Operations are now represented by an Op defrecord (jepsen.history.Op) instead of maps. This yields significant performance and speed improvements. Ops have mandatory :index and :time fields, both longs. See jepsen.history for more details.
- Histories are incrementally streamed to the
test.jepsen
file, and sealed in 16384-operation chunks. If a test crashes during the run or analysis phase, you can likely recover some of its history and re-analyze it. - Histories are now represented by subtypes of jepsen.history.History. These should be compatible with vectors, but stream their contents lazily from disk. Mapping between invocations and completions is now built in to histories, rather than being an external pair-index structure. Histories support efficient linear and concurrent folds with stream fusion and multi-query optimization, and directly support Tesser folds. Analyses may be 1-2 orders of magnitude faster, depending on hardware. See jepsen.history for details.
- dom-top.core has a new
reducer
macro which roughly doubles performance for reductions with multiple accumulator variables. - Elle can catch new classes of anomalies, especially involving realtime and process-including anti-dependency cycles.
lein run analyze
now pulls the test arguments out of the test; you don't have to pass them every time.- A new
nemesis.combined/file-corruption-package
provides support for bitflips and truncation of files. - A new
nemesis.combined/packet-package
induces network latency and packet loss. - A new
tests.kafka
namespace supports tests for Kafka-style append-only ordered logs. util/rand-distribution
supports picking random numbers
Significant API Changes
- Operations are now jepsen.history.Ops, not maps.
:index
and:time
fields are now mandatory. - Histories are now subtypes of jepsen.history.History, not vectors. They should be mostly API compatible, and will transparently promote themselves to vectors on certain operations (for instance, conj).
- Generator contexts are now jepsen.context.Contexts, rather than maps. Accessing their old fields will throw and warn you to use new polymorphic functions in jepsen.context.
lein run analyze
now takes-t path-to-test
or-t test-index
, rather than the full arguments to recreate the test map.test.fressian
files, deprecated in 0.2.x, are no longer generated. Usetest.jepsen
instead.
Performance Improvements
- Accessing operations is much faster thanks to jepsen.history.Op
- jepsen.generator is roughly an order of magnitude faster, especially for high (~thousands of threads) concurrency tests, thanks to the new generator.context.Context type.
- Generators can now dynamically compile context-filtering operations to BitSet intersections, which speeds up
reserve
,on-threads
,clients
,nemesis
, and other generators. - Reductions over histories (e.g. basically every checker) are 1-2 orders of magnitude faster, thanks to jepsen.history.
- Elle is roughly an order of magnitude faster, thanks to jepsen.history and careful parallelization.
- Assorted optimizations to generator/fill-in-op, soonest-op-mop, and reserve make them significantly faster.
- Tests no longer need to wait for history writing at the end of the test, since it's streamed to disk.
- Using functions as generators is now faster; we perform arity reflection only once rather than on every op.
- store.fressian decodes lists as vectors directly, rather than post-processing them. This makes Fressian decoding significantly faster.
Minor Improvements
- Jepsen and Elle used knossos.history and knossos.op extensively. These have been almost entirely replaced with jepsen.history.
- Most checkers have been rewritten to use jepsen.history; many reductions are now concurrent folds.
- Knossos 0.3.9
- Tools.cli 1.0.214
- Unilog 0.7.31
- Ring 1.9.6
- SSHJ 0.34.0
- Elle 0.1.6
- Lazyfs c16518f6
- Assorted type hints and compiler warnings resolved
- Contexts are deterministic again, rather than stochastic. This may break tests that depended on specific nondeterministic orders.
Full Changelog: v0.2.7...v0.3.0
0.2.7
This release introduces improved performance for control/exec
by default. It adds new features for testing filesystem failures: nemesis/bitflip
, which flips random bits in files, and lazyfs
(experimental, known bugs) which loses un-fsynced writes to files. It fixes several minor bugs--for instance, failing to thread state correctly through the nemesis setup lifecycle--and catches up to new APIs and file locations in recent versions of Debian.
As an aside: getting Jepsen to run in Docker has been an ongoing tirefire for years, and the docs now recommend using plain old LXC or AWS instead.
API Changes
- The default remote for SSH is now
control.sshj
, notcontrol.clj-ssh
. This has been available for a few releases now, and is significantly faster than clj-ssh. This should be a basically transparent change, but some error messages thrown during e.g. unstable connections might change, and you might encounter different behavior around how it handles host and identity keys, agents, etc. - The
db/Process
protocol is now calleddb/Kill
; the metaprogramming hacks we had to do to call itProcess
were fragile under some AOT scenarios.Process
remains as an alias. util/await-fn
now catches allException
s, rather than justRuntimeException
s. It turns out some things you'd really like to retry, like SQL connection exceptions from JDBC, aren'tRuntimeException
s.control.util/grepkill
now usespgrep
to kill processes, rather than grep. Some tests were killing unexpected processes.
Bugfixes
db/Process
could, under some AOT scenarios, get compiled multiple times and fail to register as the same protocol. This meant that tests could quietly fail to actually kill a process because they thought the DB didn't support theProcess
protocol. This should hopefully be fixed now by Even More Metaprogramming Hacks, but we recommend moving todb/Kill
just in case there are more bugs along these lines.lein run serve
no longer trusts the local clock when listing local tests. This should fix issues with copying files from a machine in the future to one in the past, and those tests not showing up until the second node's clock catches up.- The
control.sshj
remote now respects{:dummy? true}
. nemesis.time
's programs for bumping and strobing the clock no longer ran properly on newer platforms, thanks to a change which made it illegal to pass a time and a timezone tosettimeofday
. We didn't change the timezone, but it still failed to run.core/run!
discarded the return value ofnemesis/setup!
and used the original nemesis throughout the test. Now it correctly uses the returned nemesis.nemesis/Validate
returned invocation, not completion ops, and also did nothing after the initialsetup!
call. Both of these bugs are now fixed, which should provide better error guidance to users who make mistakes writing nemeses.tcpdump
is located in/usr/bin
on more recent versions of Debian; we now use the newer path.control/exec
no longer incorrectly reports a command's STDIN asnil
when throwing exceptions.docker/bin/up
works on OS X again.
New Features
jepsen.lazyfs
, an experimental project for simulating the loss of un-fsynced writes, is now available. It does not work correctly--lazyfs has both crash and safety bugs in this version--but it still might help you find bugs.nemesis/bitflip
is a new nemesis which can flip a random fraction of bits in a file. Helpful for fuzzing DBs' ability to handle filesystem corruption.store/fressian
now serializes exceptions as data. A recurring problem in Jepsen tests is having aThrowable
get into the history somewhere, and then exploding the serializer when it comes time to write the test. This is especially frustrating when nothing in the test itself logs that exception--you have no idea where it's coming from. Jepsen now serializes exceptions to data; this will not round-trip properly, but it does help you figure out the exception and operation that went wrong. These exceptions are also logged at level WARN during serialization. At the repl you can load the test and use a new utility function,jepsen.util/deepfind
, to find the offending object.util/rand-exp
generates random, exponentially-distributed values around a given mean.tests.cycle.wr
now has a test constructor and docstring aligned withtests.cycle.list-append
, as well as updated docs.
Small Changes
- We used to round off milliseconds in tests'
:start-time
, but this causes collisions when you run multiple tests in the same second. We now use millisecond resolution again. reconnect
now passes throughInterruptedIOException
in the same way asInterruptedException
, which should speed up/clarify the abort procedure when something goes wrong in e.g. DB setup using thecontrol.sshj
remote.util/stop-daemon!
now throws a timeout when thekill
operation hangs.nemesis.time
now throws more informative errors when compilation fails- New tests for nemeses
- Tests are a little quieter about logging now
- Clojure 1.11.1
- Unilog 0.7.30
- SSHJ 0.33.0
- Fipp 0.6.26
- Elle 0.1.5
- HTTP-kit 2.6.0
0.2.6
The biggest change in this release is the introduction of a custom file format for Jepsen tests: each store directory now has a test.jepsen
file. This file can be incrementally written, which speeds up testing and allows for much better crash recovery. It can also be lazy-loaded: loading a test from disk takes only a few milliseconds rather than multiple minutes (for reasonably chonky tests), and gives you immediate access to the top-level test map and the results :valid?
field. Loading the history or full results map is lazy and cached. This makes working with tests at the REPL much nicer, and you'll find new functions for loading tests at the repl too: store/all-tests
and store/test
. The new file format also speeds up the web interface and makes it more accurate for tests where results.edn is large or truncated.
All store functions are backwards compatible, using test.jepsen where possible, and falling back to results.edn and test.fressian otherwise. We still produce test.fressian files, but this will (unless people complain) be discontinued in the next Jepsen release to save storage space and speed up testing.
There are some improvements to the membership nemesis framework that make it easier to debug and to track complex state. Clients can explicitly signal that they'd like to tear down the process, even if they don't crash. You'll also find some quality-of-life improvements, including the ability to map log-files to short local paths.
API Changes
- No significant breaking API changes
Bugfixes
- Docker scripts no longer print out control characters on OS X / latest Debian
- Docker scripts don't throw "unbound variable" when POSITIONAL is empty
- db/tcpdump now allows you to specify multiple ports correctly--they were combined using
and
, notor
. - The web interface no longer flags some valid? tests as incomplete
New Features
- Tests are now stored in a new binary format (jepsen.store.format) which allows lazy loading of histories and results, as well as incremental saves.
- client/invoke can add
:end-process? true
to a completion operation, which forces the interpreter to terminate the process even if the op was ok/fail. Helpful for clients which need to be torn down even on definite failure. - Web interface now uses test.jepsen, and uses a local cache to dramatically speed up load times
- Web interface has some basic pagination, for when you've got thousands of tests in a store dir
- db/log-files can now return a map of remote paths to short local paths, which lets you avoid deep nested paths
- nemesis.membership/invoke! can now return [op, member-state] pairs. This allows membership to track states resulting from applying operations in a purely functional way. Membership state changes now involve an exclusive lock, but node view fetching remains nonblocking.
Minor Changes
lein run analyze
andjepsen.core/run!
read and emit test.jepsen files in addition to test.fressian.- Web interface now displays local times for test runs
- Web interface click-to-copy-test-dir now uses double instead of single quotes, so you can paste at the REPL
- Docker scripts are now compatible with cgroupv2
- Compiled for Java source level 11
- store/test loads a test by either a test directory string, or an integer index (-1 for most recent)
- store/all-tests gives a lazy sequence of every test in chronological order
- Some fressian-related vars in jepsen.store have been moved to jepsen.store.fressian, but copies remain in jepsen.store for backwards compatibility
- Fressian writers and readers are now available as first-class vars in jepsen.store.fressian for later composition, if you so desire
- nemesis.membership offers slightly clearer logging messages
- fs-cache/write-atomic creates tempfiles in the same directory as the destination, so targets need not be in /tmp.
- elle 0.1.4
- dom-top 1.0.7
- tools.logging 1.2.4
- unilog 0.7.29
- ring 1.9.5
- bouncycastle/bcprov-jdk15on 1.70
- fipp 0.6.25