content addressed replication #49

carlsverre · 2024-01-25T02:10:06Z

Supercedes #45

In this replication variant, pages are replicated based on their content hash. Thus the current state of the file can be computed from the content store (provides access to pages by their content hash) and the current snapshot (list of content hashes).

Hash algorithms

see experiments repo
blake3 is the safest choice which outputs cryptographic 32 byte hashes
polymur is a faster choice which outputs 8 byte hashes at a collision rate of n * 2^-60.2 where n is the input size in bytes

snapshot size

assuming an object size of 4k and a hash output size of 32 bytes, a 1GB database snapshot will consume 8MB

notes on efficient replication

clients always want to update to the latest snapshot
ideally we are able to only store changes to the snapshot rather than re-storing the full snapshot
a tree based structure can allow for efficient structural sharing (conceptually similar to a merkel tree)
blake3 is a tree based hash function that internally computes a tree of hashes over 1kb chunks, and bao is an application of blake3 for verifiable streaming/seeking without needing full access to the data
one of the authors of Blake3 is currently factoring out the needed pieces of the blake3 crate for me to sit my replication algorithm directly on top of it (I think)

The text was updated successfully, but these errors were encountered:

carlsverre · 2024-01-25T02:29:50Z

Since SQLite changes it's header on every write to the database file, we should consider extracting the header (or simply the first page) out into the snapshot (or just sending the first page along with snapshot updates).

carlsverre self-assigned this Jan 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

content addressed replication #49

content addressed replication #49

carlsverre commented Jan 25, 2024

carlsverre commented Jan 25, 2024

content addressed replication #49

content addressed replication #49

Comments

carlsverre commented Jan 25, 2024

Hash algorithms

snapshot size

notes on efficient replication

carlsverre commented Jan 25, 2024