-
-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve OOB change handling #243
Comments
What you call "computing a diff" would require interpreting OOB changes and map them to e.g. a notebook schema, which is not an easy task. In this issue I proposed a bridge between different document structures, that is basically solving the same issue. ycell = ynotebook["cells"][idx]
ytext = ycell["source"]
ytext.insert(pos, char) For more complex changes, it will even be ambiguous to infer changes. See for instance how |
Introduction
(this section is mostly copied from #240)
An out-of-band (abbrev. OOB) change is an edit of a collaborative file not done through JupyterLab's UI. For example, suppose user A runs
git reset --hard <other-branch>
while another user B is editing a collaborative document within the same repository. A'sgit reset
deletes all edits made by B and sets the contents of that document to some fixed value. Then the YStore table containing B's edits needs to be dropped, and this event must be somehow communicated to all Yjs clients connected to the document. Here, we say that an out-of-band change occurred on B's document.Please see #240 for information on how OOB changes are detected & signaled. This issue focuses exclusively on how OOB changes are handled by the backend.
Description
The current implementation of how OOB changes can be improved to potentially reduce data loss and to greatly improve performance on long-running collaboration sessions.
Context: how OOB changes are handled currently
The handling of OOB changes is implemented in
DocumentRoom
, mostly in the_on_outofband_change()
method. The relevant portions of its source are:From the above source, we see that when an OOB change occurs,
DocumentRoom
will fetch the latest file contents. If that does not raise an exception, then the most important block is reached, which is highlighted below. When an OOB change occurs, this block sets the source of the YDoc to the latest file contents:To understand this, we need to know what A) what
self._document
is, and B) what happens whenself._document.source
is set.self._document
is a Ydoc fromjupyter_ydoc
.jupyter_ydoc
provides YDoc-like classes that are wrappers aroundpycrdt.Doc
, each of which implement a unique interface for a specific file type.jupyter_ydoc
currently provides three types of Ydoc-like classes:YUnicode/YFile
: a YDoc that represents a UTF-8 plaintext fileYNotebook
: a YDoc that represents a notebook fileYBlob
: a YDoc that represents a blob fileWe will refer to these classes provided by
jupyter_ydoc
as Jupyter Ydocs, for specificity.We now know what
self._document
is. It is one of the 3 Jupyter Ydocs, all of which inherit fromjupyter_ydoc.YBaseDoc
. Now we need to know what happens whenself._document.source
is set.Here are the relevant portions of
YBaseDoc
's source that determine the behavior of setting itssource
attribute:As shown above,
get()
andset()
are implemented in the YDoc class directly inheriting fromYBaseDoc
. This means that ultimately, the handling of an OOB change depends on the file type, and is determined by the definition ofset()
in the corresponding Jupyter Ydoc.Issue part 1: Verify that OOB change handling doesn't lead to data loss
I have looked through the latest implementation of the
set()
method onYUnicode
,YNotebook
,YBlob
, and could not find any immediate concerns. Each implementation flags all of the existing items for deletion, and then inserts the new content from the beginning. This does not delete the Ydoc and therefore should not result in any data loss. I've also tested the OOB change handling in the latest version ofjupyter_collaboration
with this one-liner shell script:This script sleeps 5 seconds, then makes an OOB change that sets the content of the first cell to a fixed string. I ran this script and immediately began making edits to the first cell until the OOB change occurred. This did not lead to any data loss for me locally.
While I have not yet found any specific code paths that lead to data loss in the latest implementation, I think that it remains debatable whether the current OOB change handling leads to data loss in certain scenarios. This is supported by the fact that myself and others have noticed OOB changes being logged by the extension immediately prior to data loss incidents. See #219 as an example.
Issue part 2: Each OOB change grows the Ydoc by at least the new file size
The implementations of
set()
, while correct, have the unfortunate consequence of growing the Ydoc's size by a factor of the new file size. That is, an OOB change on a 0.5 MB notebook will grow the Ydoc by at least 0.5 MB.You can confirm this locally with these steps:
1kb.txt
locally.jupyter_collaboration
installed.1kb.txt
through JupyterLab..jupyter_ystore.db
.ls -lah . | grep 'ystore.db'
..jupyter_ystore.db
again.On my machine, the Ystore starts at 12K and grows to 20K after making 5 OOB changes. I expected a growth in size of at least 5K, and observed a growth of 8K. This test confirms that currently, each OOB change grows the Ydoc and Ystore by at least the new file size.
Larger Ydocs lead to poorer performance. The most significant consequence is that a larger Ydoc results in a slower connection for new users, as the entire Ydoc is streamed to new clients. I will elaborate on this in a future issue concerning the YStore specifically.
Proposed next steps
Increase confidence in the claim that OOB changes do not lead to data loss. To do so, we should:
Reduce the growth in YStore size on OOB changes
DocumentRoom
either to a) completely ignore OOB changes, or b) to delete the YDoc from memory, drop the YStore table, and begin anew.The text was updated successfully, but these errors were encountered: