Skip to content

Latest commit

 

History

History
281 lines (227 loc) · 22.5 KB

lib-collab-crdt-blog.md

File metadata and controls

281 lines (227 loc) · 22.5 KB
title tags created modified
lib-collab-crdt-blog
blog
collaboration
crdt
2022-10-13 07:59:52 UTC
2022-10-13 08:00:21 UTC

lib-collab-crdt-blog

guide

  • the beginning of a survey of CRDT techniques and how to apply them to collaborative apps.
    • This post gives a gentle definition of op-based and state-based CRDTs and outlines my goals for the survey.
    • Part 2 of my CRDT survey blog series is out! This post covers "semantic techniques": deciding what a collaborative should do for users, independent of implementation details like op-based vs state-based.
  • There is another class of CRDTs: Merkle CRDTs.
  • Merkle-DAG CRDTs
  • Merkle Search Tree CRDTs
  • As the main author of the Merkle Search Tree paper, I'm really glad to see the design become more known and start to be used.

    • At Martin Kleppman's recommendation, we adopted MSTs for AT Protocol's data repository structures. This isn't my specialty so I wasn't deeply involved in evaluating or implementing them, but my understanding is that the self-balancing was a deciding factor and that everyone has been very happy with the outcomes.
  • Last Write - What Can Go Wrong?

    • Error 1: Forgetting to Update the Loser's Timestamp
    • Error 2: Forgetting & Inconsistent Tie Breaking
    • Error 3: Clock Pushing when Proxying Changes, A向中间代理B请求12点后的changes,B向C请求11点后的
    • Error 4: Trusting System Time
  • The simplest logical clock implementation is simply to keep an integer in your process that you increment for every event.

    • Each event gets timestamped with this counter.
    • To totally order events within your process, order by that timestamp.
    • To scale this up to many cores, use an atomic integer that you can compare-and-swap.
  • A register is a CRDT that holds a single value.

    • There are a couple kinds of registers, but the simplest is the Last Write Wins Register (or LWW Register).
  • LWW Registers simply overwrite their current value with the last value written.

    • They determine which write occurred last using timestamps, represented here by integers that increment whenever the value is updated.
  • You might ask: why not use the actual time?

    • Unfortunately, accurately syncing clocks between two computers is an extremely hard problem.
    • Using incrementing integers like this is one simple version of a logical clock, which captures the order of events relative to each other rather than to the “wall clock”.
  • In practice, most apps will only need Last-Writer-Wins registers and not the more complicated sequence CRDT's that you find in Y.js and Automerge.

    • We've built a auto-syncing database that uses CRDTs under the hood but never exposes them through the API. So if you want all of the benefits of CRDTs e.g. offline-first user experience, checkout our project, Triplit!
  • why didn't/don't you build Triplit on cr-sqlite?

    • We're aware of cr-sqlite and I've talked to the author, Matt, a few times.
    • Short answer: SQLite has serious shortcomings when it comes to reactivity and we think we can be as fast as SQLite for the application-type queries we aim to support.
    • The long answer would be about supporting all of features we don't need in SQLite and all of the quirks that come with it like having null as a primary key
    • We also recently came up with a relational-style querying system without joins
  • the fact that we always use text editing as the de-facto solution is so weird to me since that problem is both niche and extremely complex

    • The reason we use that is because it is complex enough to show the problems that CRDTs solve
    • Simple text inserts with a simple "insert after" CRDTs is not much more complicated but involves things like generated unique IDs without communication and how to resolve conflicts with some sort of globally consistent ordering.
  • I agree that it's simple, but that's exactly what makes it so powerful — it's easy to understand and yet you can do a ton with it. I've been working on a vector editor as well that's also built with just registers and maps. That one is a bit long for a blog post (although I might do a high level overview of some techniques like fractional indexing). But the point I'm trying to drive at here is that you can get really far just by combining simple CRDTs.

  • Figma's interesting because it's not strictly speaking a CRDT. It borrows heavily from some of the CRDT ideas, but it's really an editable tree where most changes are atomic and thus use a last-writer-wins approach. That, and re-ordering tree nodes uses fractional indexing.

    • Figma's interesting because it's not strictly speaking a CRDT. It borrows heavily from some of the CRDT ideas, but it's really an editable tree where most changes are atomic and thus use a last-writer-wins approach. That, and re-ordering tree nodes uses fractional indexing.
  • In An Interactive Intro to CRDTs, we learned what CRDTs are, and implemented two: a Last Write Wins Register and a Last Write Wins Map.
    • We now have everything we need to build a collaborative pixel art editor
  • If you take a closer look at the implementation details of block-wise RGA algorithms, you may notice that CRDT specifics usually depend on the metadata alone - we are usually only interested in the length of the content rather than content itself.

    • In reality we could abstract it into a separate structure optimized for text manipulations (like Rope or string buffers of specific editors), while RGA-specific metadata only holds ranges that given block is responsible for.
  • You can think of registers as value cells, that are able to provide CRDT semantic over any defined type.

    • Remember, that we're still constrained by commutativity/associativity/idempotency rules.
    • For this reason we must apply additional metadata, which will allow us to provide arbitrary conflict resolution in case of conflict detection.
  • Registers are one the most wide spread ways of working with CRDTs.

    • The reason is simple: they allow us to wrap any sort of ordinary data types in order to give them CRDT-compliant semantics.
    • However, this comes at the price - we cannot simply change any non-CRDT value into CRDT without some compromises: if that would be possible, we wouldn't need CRDTs in the first place.
    • For these, two popular approaches are known as last-write-wins and multi-value registers.
  • 👉🏻 Last write wins is a popular way of dealing with conflicts being the result of concurrent updates in many systems (Cassandra is good example of a database that's quite known from this approach).

  • The algorithm is simple: on each register value update, we timestamp it.

    • If our current timestamp is higher than the most recent one remembered by the register itself, we change the value, otherwise leave existing one (more recent) untouched.
  • One of the issues of last-write-wins approach is its inherent risk of data loss - we took an easy to use API at price of automatically throwing out potentially useful data.

  • 👉🏻 Another approach is to build a register that doesn't drop any data, instead it keeps track of all causal (happened-before) relationships between updates and in case of conflicts returns all conflicting cases.

    • It's known as Multi Value Register.
  • We track causality of register updates, and in case of conflicts - as when two independent updates are happening concurrently - we'll provide all conflicting values.

    • Programmer is then free to choose any value and override concurrent conflicts with more recent update.
  • A register may contain arbitrary data with some additional metadata on top.

    • Precisely what kind of metadata varies across different register implementations.
  • Assignments may be done concurrently by different threads and merging may:

    • discard all concurrent assignments but one,
    • keep all concurrent assignments.
  • A Last-Writer-Wins Register (LWW-Register) creates a total order of assignments by associating a timestamp with each update.

    • I understand that the explanation of LWW-Registers may be a little anticlimactic. They literally only store a single piece of metadata that is used to impose a total ordering.
  • But again, their power lies within their composition

    • values are now LWW-Registers that I express as a pair of (value, time).
    • map + lww
  • The Multi-Value Register (MV-Register) does not assume the presence of an ordered clock across all nodes.

    • Instead, it relies on a vector clock
  • The basic principle of a vector clock is the same as for a Lamport clock.

    • The clock does not measure real time; rather, it increments whenever an event occurs.
    • Each node keeps its own clock.
    • The difference now is that each node keeps each other node’s clock, too (i.e., a “vector” of clocks).
    • When a message is sent, the sending node increases only its own clock, but includes a copy of the entire vector in the message.
    • On the other side when a message is received, the receiving node again increments its own clock, and for everyone else’s clock, it takes the maximum of the own vector and the received vector.
  • Using a vector clock, this situation can be detected easily.

    • Whenever someone writes a value into the register, they also record the current state of their vector clock.
  • Now let’s say Alice and Bob want to merge their MV-Registers.

    • They first have to compare the timestamp, just like for a LWW-Register.
    • But whereas in a LWW-Register, they just compare two numbers, here they have to compare an entire vector.
    • If the two vector clocks are incomparable, then we have a concurrent write. It occurs when there is a connection loss right after the start. It means that both writes happened concurrently and we don’t really know which one is “better” than the other.
    • So what do we do? The name “Multi-Value Register” already gives it away. We keep both writes.
    • Neither value’s vector clock is ≥ any other value’s vector clock. Consequently, the MV-Register now simultaneously contains {0, 2, 3, 4}.
  • You might be wondering how such a mess could ever be cleaned up? Well, future writes might do that.

more-register

  • CRDT: Conflict-free Replicated Data Types

  • register: A memory cell with two operations — assign() and value().

    • The issue is with the assign() operations — they do not commute.
    • There are two approaches to solve this issue: lww / mv
  • lww

    • Columns in cassandra
    • NFS — a whole file or a part of it
  • mv

    • Amazon shopping basket. It has a well-known bug when an item re-appears in the basket after deletion. The reason is that MV-Register doesn’t behave like a set even though it stores a set of values (see below). Amazon indeed doesn’t treat that as a bug — it actually increases sales.
  • 通讯协议:WebSocket vs HTTP

  • 数据格式:JSON vs Binary

  • 协作算法:中心化还是去中心化? OT 算法 vs CRDT

  • 服务端:Actix + Diamond Types + CRDT

  • 客户端:编辑生成 patches

  • 客户端:编辑器应用 patches

  • 先简单统一一下概念

    • object: 可以理解为“副本”
    • operation: 操作接口,由客户端调用,分为两种,读操作query和写操作update
    • query: 查询操作,仅查询本地副本
    • update: 👉🏻 更新操作,先尝试进行本地副本更新,若更新成功则将本地更新同步至远端副本
    • merge: update在远端副本的合并操作
  • 一个数据结构符合CRDT的条件是update操作和merge操作需满足交换律、结合律和幂等律

  • 如果update操作本身满足以上三律,merge操作仅需要对update操作进行回放即可,这种形式称为op-based CRDT,最简单的例子是集合求并集。

  • 如果update操作无法满足条件,则可以考虑同步副本数据,同时附带额外元信息,通过元信息让update和merge操作具备以上三律,这种形式称为state-based CRDT。

  • 让元信息满足条件的方式是让其更新保持__单调__,这个关系一般被称为__偏序关系__。

    • 举一个简单例子,每次update操作都带上时间戳,在merge时对本地副本时间戳及同步副本时间戳进行比对,取更新的结果,这样总能保证结果最新并且最终一致,这种方式称为Last Write Wins
  • update操作无法满足三律,如果能将元信息附加在操作或者增量上,会是一个相对state-based方案更优化的选择

  • 如果同步过程能确保exactly once的语义,幂等律条件是可以被放宽掉,比如说加法本身满足交换律结合律但不幂等,如果能保证加法操作只回放一次,其结果还是最终一致的。

  • 以 Op-based CRDT 的思路设计 Last-write-wins Set(LWWSet)

crdt: state-based/CvRDTs vs operation-based/

  • Two kinds of CRDTs · CRDTs: Part 8

  • you might think that operation-based CRDTs are superior to their value-based friends.

    • The messages are smaller and they allow you to do more things since they are not constrained by monotonicity.
    • But there’s a catch (of course there is): The convergence theorem for them requires reliable broadcast channels.
    • That is: if a message with an operation gets lost, suddenly your replicas won’t agree on the correct value anymore.
    • It’s not impossible to set up a reliable channel
  • A service framework for operation-based CRDTs - Martin Krasser's Blog

  • State-based CRDTs are designed to disseminate state among replicas whereas operation-based CRDTs are designed to disseminate operations.

  • CmRDT/op replicas are guaranteed to converge if operations are disseminated through a reliable causal broadcast (RCB) middleware and if they are designed to be commutative for concurrent operations.

  • CvRDTs/state don’t require special guarantees from the underlying messaging middleware but require increasing network bandwidth for state dissemination with increasing state size.

    • They converge if their state merge function is defined as a join: a least upper bound on a join-semilattice. CvRDTs are not further discussed in this article.
  • The execution of a CmRDT operation is done in two phases, prepare and effect (also called atSource and downstream ):

    • prepare is executed on the local replica. It looks at the operation and (optionally) the current state and produces a message, representing the operation, that is then disseminated to all replicas.
    • effect applies the disseminated operation at all replicas.
  • State-based/CvRDT

    • replicas propagated changes by sending a full state of the objects.
    • A merge() function must be defined to merge incoming changes with the current state.
    • Used in such file systems as NFS, AFS, Coda and in such KV-storages like Riak, Dynamo.
  • Operation-based/CmRDT

    • replica propagates changes by sending the operation to all replicas.
    • Used in such cooperative systems, like Bayou, Rover, IceCube, Telex.
  • Delta-based

    • combines both State/op approaches and propagates so-called delta-mutators which update the state accordingly to the last synchronisation date.
    • You need to send a full state in case of the first ever synchronisation, however, some realisations actually consider the state of the remote replicas to lower the amount of needed data.
  • Pure operation-based

    • There is a delay in the op-based synchronization to create an effector().
    • In some systems such delays are unacceptable.
    • In this case an update must be propagated immediately at the cost of more complex organization protocol and more space requirements for metadata.
  • op-based synchronization requires effector() to be delivered only once to each replica.

    • 👉🏻 This requirement can be loosened by requiring effector() to be idempotent.
    • In practice, it is much easier to implement the former than the latter.
  • Op-based and State-based synchronizations can be emulated(仿真,模拟) by each other with keeping CRDT requirements.

  • 对各种数据类型进行了简介

  • Data consistency is hard.

    • Approaches to this usually take the form of ACID guarantees, BASE (basically available, eventually consistent) and in the case of CRDT strong eventual consistency.
  • How do we get events to be in the right order?

    • We can use the server's timestamp in a typical architecture to serve as the authoritative clock but this isn't available in distributed systems.
    • Logical Timestamps. Doing this kind of ordering helps us get closer to total ordering.
    • Event Logs. this is typically called event sourcing, where every operation is stored as data and playback is used when you need to construct the model. To give a log total ordering we need to sort operations by time and then by unique origin.
  • Consider how Figma makes these tradeoffs by combining multiple CRDT-inspired techniques (associativity, commutativity, idempotent) with a last-write-wins (LWW) approach and a central server to coordinate some aspects of conflicts and updates.

    • Other things like tree re-ordering in the document, undo/redo behaviors, are resolved with domain specific desired implementations.
    • Additionally, Figma side-steps the entire realtime text interleaving/merging issue due to the fact that the text content of a text element is just another property on an object.
    • As such, changes in this system still follow the LWW model they have gone with.
  • Additionally, they go over the Fractional Indexing approach that looks very much like the approach seen in Logoot/LSEQ.

  • Given that you have no record of tombstones (like in the case of Logoot/LSEQ) then you run into the problem of interleaving edits on merge.

    • However, Figma can make this trade-off due to the fact that it is a design tool not necessarily centered around realtime text editing.
  • OR-Set (Observed-Remove Set)

    • Similar to LWW-Element-Set except uses tags instead of timestamps.
    • For each element in the set, a list of add-tags and a list of remove-tags are maintained.
    • Insertions are done by generating a unique tag and added to the add-tag list for the element. Elements are removed by having all the tags in the element's add-tag list added to the element's remove tag set (tombstone list).
    • Merging is done by the union of the add-tag list for each element, and the union of remove-tag lists for each element.
    • An element is a member if the element's add tag list - remove tag list is nonempty.
  • Logoot/LSEQ/TreeDoc does not store tombstone deletions and instead takes an approach with fractional/unbounded divisible indexing.

    • This, as discussed earlier, can cause issues with interleaving edits so it becomes impractical to guarantee proper sane convergence with concurrent operations.
  • RGA

    • Widely considered to be one of the fastest CRDT implementations.
    • Utilizes a timestamped insertion tree and references hash table lookups for character lookups.
    • Causal trees are a similar approach (named CT but in RGA it's called a timestamped insertion tree).
    • In any case, CT/RGA algorithms have been proven to use the same algorithm

more-crdt