Skip to content

Latest commit

 

History

History
179 lines (144 loc) · 8.99 KB

INTERNALS.md

File metadata and controls

179 lines (144 loc) · 8.99 KB

Yjs Internals

This document roughly explains how Yjs works internally. There is a complete walkthrough of the Yjs codebase available as a recording: https://youtu.be/0l5XgnQ6rB4

The Yjs CRDT algorithm is described in the YATA paper from 2016. For an algorithmic view of how it works, the paper is a reasonable place to start. There are a handful of small improvements implemented in Yjs which aren't described in the paper. The most notable is that items have an originRight as well as an origin property, which improves performance when many concurrent inserts happen after the same character.

At its heart, Yjs is a list CRDT. Everything is squeezed into a list in order to reuse the CRDT resolution algorithm:

  • Arrays are easy - they're lists of arbitrary items.
  • Text is a list of characters, optionally punctuated by formatting markers and embeds for rich text support. Several characters can be wrapped in a single linked list Item (this is also known as the compound representation of CRDTs). More information about this in this blog article.
  • Maps are lists of entries. The last inserted entry for each key is used, and all other duplicates for each key are flagged as deleted.

Each client is assigned a unique clientID property on first insert. This is a random 53-bit integer (53 bits because that fits in the javascript safe integer range [JavaScript uses IEEE 754 floats]).

List items

Each item in a Yjs list is made up of two objects:

  • An Item (src/structs/Item.js). This is used to relate the item to other adjacent items.
  • An object in the AbstractType hierarchy (subclasses of src/types/AbstractType.js - eg YText). This stores the actual content in the Yjs document.

The item and type object pair have a 1-1 mapping. The item's content field references the AbstractType object and the AbstractType object's _item field references the item.

Everything inserted in a Yjs document is given a unique ID, formed from a ID(clientID, clock) pair (also known as a Lamport Timestamp). The clock counts up from 0 with the first inserted character or item a client makes. This is similar to automerge's operation IDs, but note that the clock is only incremented by inserts. Deletes are handled in a very different way (see below).

If a run of characters is inserted into a document (eg "abc"), the clock will be incremented for each character (eg 3 times here). But Yjs will only add a single Item into the list. This has no effect on the core CRDT algorithm, but the optimization dramatically decreases the number of javascript objects created during normal text editing. This optimization only applies if the characters share the same clientID, they're inserted in order, and all characters have either been deleted or all characters are not deleted. The item will be split if the run is interrupted for any reason (eg a character in the middle of the run is deleted).

When an item is created, it stores a reference to the IDs of the preceeding and succeeding item. These are stored in the item's origin and originRight fields, respectively. These are used when peers concurrently insert at the same location in a document. Though quite rare in practice, Yjs needs to make sure the list items always resolve to the same order on all peers. The actual logic is relatively simple - its only a couple dozen lines of code and it lives in the Item#integrate() method. The YATA paper has much more detail on this algorithm.

Item Storage

The items themselves are stored in two data structures and a cache:

  • The items are stored in a tree of doubly-linked lists in document order. Each item has left and right properties linking to its siblings in the document. Items also have a parent property to reference their parent in the document tree (null at the root). (And you can access an item's children, if any, through item.content).
  • All items are referenced in insertion order inside the struct store (src/utils/StructStore.js). This references the list of items inserted by for each client, in chronological order. This is used to find an item in the tree with a given ID (using a binary search). It is also used to efficiently gather the operations a peer is missing during sync (more on this below).

When a local insert happens, Yjs needs to map the insert position in the document (eg position 1000) to an ID. With just the linked list, this would require a slow O(n) linear scan of the list. But when editing a document, most inserts are either at the same position as the last insert, or nearby. To improve performance, Yjs stores a cache of the 80 most recently looked up insert positions in the document. This is consulted and updated when a position is looked up to improve performance in the average case. The cache is updated using a heuristic that is still changing (currently, it is updated when a new position significantly diverges from existing markers in the cache). Internally this is referred to as the skip list / fast search marker.

Deletions

Deletions in Yjs are treated very differently from insertions. Insertions are implemented as a sequential operation based CRDT, but deletions are treated as a simpler state based CRDT.

When an item has been deleted by any peer, at any point in history, it is flagged as deleted on the item. (Internally Yjs uses the info bitfield.) Yjs does not record metadata about a deletion:

  • No data is kept on when an item was deleted, or which user deleted it.
  • The struct store does not contain deletion records
  • The clientID's clock is not incremented

If garbage collection is enabled in Yjs, when an object is deleted its content is discarded. If a deleted object contains children (eg a field is deleted in an object), the content is replaced with a GC object (src/structs/GC.js). This is a very lightweight structure - it only stores the length of the removed content.

Yjs has some special logic to share which content in a document has been deleted:

  • When a delete happens, as well as marking the item, the deleted IDs are listed locally within the transaction. (See below for more information about transactions.) When a transaction has been committed locally, the set of deleted items is appended to a transaction's update message.
  • A snapshot (a marked point in time in the Yjs history) is specified using both the set of (clientID, clock) pairs and the set of all deleted item IDs. The deleted set is O(n), but because deletions usually happen in runs, this data set is usually tiny in practice. (The real world editing trace from the B4 benchmark document contains 182k inserts and 77k deleted characters. The deleted set size in a snapshot is only 4.5Kb).

Transactions

All updates in Yjs happen within a transaction. (Defined in src/utils/Transaction.js.)

The transaction collects a set of updates to the Yjs document to be applied on remote peers atomically. Once a transaction has been committed locally, it generates a compressed update message which is broadcast to synchronized remote peers to notify them of the local change. The update message contains:

  • The set of newly inserted items
  • The set of items deleted within the transaction.

Network protocol

The network protocol is not really a part of Yjs. There are a few relevant concepts that can be used to create a custom network protocol:

  • update: The Yjs document can be encoded to an update object that can be parsed to reconstruct the document. Also every change on the document fires an incremental document update that allows clients to sync with each other. The update object is a Uint8Array that efficiently encodes Item objects and the delete set.
  • state vector: A state vector defines the known state of each user (a set of tuples (client, clock)). This object is also efficiently encoded as a Uint8Array.

The client can ask a remote client for missing document updates by sending their state vector (often referred to as sync step 1). The remote peer can compute the missing Item objects using the clocks of the respective clients and compute a minimal update message that reflects all missing updates (sync step 2).

An implementation of the syncing process is in y-protocols.

Snapshots

A snapshot can be used to restore an old document state. It is a state vector + delete set. A client can restore an old document state by iterating through the sequence CRDT and ignoring all Items that have an id.clock > stateVector[id.client].clock. Instead of using item.deleted the client will use the delete set to find out if an item was deleted or not.

It is not recommended to restore an old document state using snapshots, although that would certainly be possible. Instead, the old state should be computed by iterating through the newest state and using the additional information from the state vector.