Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

content addressed replication #49

Open
carlsverre opened this issue Jan 25, 2024 · 1 comment
Open

content addressed replication #49

carlsverre opened this issue Jan 25, 2024 · 1 comment
Assignees

Comments

@carlsverre
Copy link
Contributor

Supercedes #45

In this replication variant, pages are replicated based on their content hash. Thus the current state of the file can be computed from the content store (provides access to pages by their content hash) and the current snapshot (list of content hashes).

Hash algorithms

  • see experiments repo
  • blake3 is the safest choice which outputs cryptographic 32 byte hashes
  • polymur is a faster choice which outputs 8 byte hashes at a collision rate of n * 2^-60.2 where n is the input size in bytes

snapshot size

assuming an object size of 4k and a hash output size of 32 bytes, a 1GB database snapshot will consume 8MB

notes on efficient replication

  • clients always want to update to the latest snapshot
  • ideally we are able to only store changes to the snapshot rather than re-storing the full snapshot
  • a tree based structure can allow for efficient structural sharing (conceptually similar to a merkel tree)
  • blake3 is a tree based hash function that internally computes a tree of hashes over 1kb chunks, and bao is an application of blake3 for verifiable streaming/seeking without needing full access to the data
  • one of the authors of Blake3 is currently factoring out the needed pieces of the blake3 crate for me to sit my replication algorithm directly on top of it (I think)
@carlsverre
Copy link
Contributor Author

Since SQLite changes it's header on every write to the database file, we should consider extracting the header (or simply the first page) out into the snapshot (or just sending the first page along with snapshot updates).

@carlsverre carlsverre self-assigned this Jan 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant