Skip to content

Commit b6a04ef

Browse files
committed
initial import
1 parent 916e507 commit b6a04ef

File tree

12 files changed

+1192
-2
lines changed

12 files changed

+1192
-2
lines changed

Cargo.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,12 @@
33
members = [
44
"crates/*",
55
"crates/bpe/benchmarks",
6-
"crates/bpe/tests",
6+
"crates/bpe/tests"
77
]
88
resolver = "2"
99

1010
[profile.bench]
1111
debug = true
1212

1313
[profile.release]
14-
debug = true
14+
debug = true

crates/hriblt/Cargo.toml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
[package]
2+
name = "hriblt"
3+
version = "0.1.0"
4+
edition = "2024"
5+
description = "Algorithm for rateless set reconciliation"
6+
repository = "https://github.com/github/rust-gems"
7+
license = "MIT"
8+
keywords = ["set-reconciliation", "sync", "algorithm", "probabilistic"]
9+
categories = ["algorithms", "data-structures", "mathematics", "science"]
10+
11+
[dependencies]
12+
thiserror = "2"

crates/hriblt/README.md

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
# Hierarchical Rateless Bloom Lookup Tables
2+
3+
A novel algorithm for computing the symmetric difference between sets where the amount of data shared is proportional to the size of the difference in the sets rather than proportional to the overall size.
4+
5+
## Usage
6+
7+
Add the library to your `Cargo.toml` file.
8+
9+
```toml
10+
[dependencies]
11+
hriblt = "0.1"
12+
```
13+
14+
Create two encoding sessions, one containing your data, and another containing the counter-parties data. This counterparty data might have been sent to you over a network for example.
15+
16+
The following example attempts to reconcile the differences between two sets of `u64` integers, and is done from the perspective of "Bob", who has recieved some symbols from "Alice".
17+
18+
```rust
19+
use hriblt::{DecodingSession, EncodingSession, DefaultHashFunctions};
20+
// On Alice's computer
21+
22+
// Alice creates an encoding session...
23+
let mut alice_encoding_session = EncodingSession::<u64, DefaultHashFunctions>::new(DefaultHashFunctions, 0..128);
24+
25+
// And adds her data to that session, in this case the numbers from 0 to 10.
26+
for i in 0..=10 {
27+
alice_encoding_session.insert(i);
28+
}
29+
30+
// On Bob's computer
31+
32+
// Bob creates his encoding session, note that the range **must** be the same as Alice's
33+
let mut bob_encoding_session = EncodingSession::<u64, DefaultHashFunctions>::new(DefaultHashFunctions, 0..128);
34+
35+
// Bob adds his data, the numbers from 5 to 15.
36+
for i in 5..=15 {
37+
bob_encoding_session.insert(i);
38+
}
39+
40+
// "Subtract" Bob's coded symbols from Alice's, the remaining symbols will be the symmetric
41+
// difference between the two sets, iff we can decode them. This is a commutative function so you
42+
// could also subtract Alice's symbols from Bob's and it would still work.
43+
let merged_sessions = alice_encoding_session.merge(bob_encoding_session, true);
44+
45+
let decoding_session = DecodingSession::from_encoding(merged_sessions);
46+
47+
assert!(decoding_session.is_done());
48+
49+
let mut diff = decoding_session.into_decoded_iter().map(|v| v.into_value()).collect::<Vec<_>>();
50+
51+
diff.sort();
52+
53+
assert_eq!(diff, [0, 1, 2, 3, 4, 11, 12, 13, 14, 15]);
54+
55+
```
16.9 KB
Loading
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# Hash Functions
2+
3+
This library has a trait, `HashFunctions` which is used to create the hashes required to place your symbol into the range of coded symbols.
4+
5+
The following documentation provides more details on this trait in particular. How and why this is done is explained in the `overview.md` documentation.
6+
7+
## Hash stability
8+
9+
When using HRIBLT in production systems it is important to consider the stability of your hash functions.
10+
11+
We provide a `DefaultHashFunctions` type which is a wrapper around the `DefaultHasher` type provided by the Rust standard library. Though the seed for this function is fixed, it should be noted that the hashes produces by this type are *not* guarenteed to be stable across different versions of the Rust standard library. As such, you should not use this type for any situation where clients might potentially be running on a binary built with an unspecified version of Rust.
12+
13+
We recommend you implement your own `HashFunctions` implementation with a stable hash function.
14+
15+
## Hash value hashing trick
16+
17+
If the value you're inserting into the encoding session is a high entropy random value, such as a cryptographic hash digest, you can recycle the bytes in that value to produce the coded symbol indexing hashes, instead of hashing that value again. This results in a constant-factor speed up.
18+
19+
For example if you were trying to find the difference between two sets of documents, instead of each coded symbol being the whole document it could instead just be a SHA1 hash of the document content. Since each SHA1 digest has 20 bytes of high entropy bits, instead of hashing this value five times again to produce the five coded symbol indices we can simply slice out five `u32` values from the digest itself.
20+
21+
This is a useful trick because hash values are often used as IDs for documents during set reconciliation since they are a fixed size, making serialization easy.

crates/hriblt/docs/sizing.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# Sizing your HRIBLT
2+
3+
Because the HRIBLT is rateless, it is possible to append additional data in order to make it decoding possible. That is, it does not need to be sized in advance like a standard invertible bloom lookup table.
4+
5+
Regardless, there are some advantages to getting the size of your decoding session correct the first time. An example might be if you're performing set reconciliation over some RPC and you want to minimise the number of round trips it takes to perform a decode.
6+
7+
## Coded Symbol Multiplier
8+
9+
The number of coded symbols required to find the difference between two sets is proportional to the difference between the two sets. The following chart shows the relationship between the number of coded symbols required to decode HRIBLT and the size of the diff. Note that the size of the base set (before diffs were added) was fixed.
10+
11+
`y = len(coded_symbols) / diff_size`
12+
13+
![Coded symbol multiplier](./assets/coded-symbol-multiplier.png)
14+
15+
For small diffs, the number of coded symbols required per value is larger, after a difference of approximately 100 values the coefficient settles on around 1.3 to 1.4.
16+
17+
You can use this chart, combined with an estimate of the diff size (perhaps from a `geo_filter`) to increase the probability that you will have a successful decode after a single round-trip while also minimising the amount of data sent.

crates/hriblt/src/coded_symbol.rs

Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
use crate::{Encodable, HashFunctions, index_for_seed, indices};
2+
3+
/// Represents a coded symbol in the invertible bloom filter table.
4+
/// In some of the literature this is referred to as a "cell" or "bucket".
5+
/// It includes a checksum to verify whether the instance represents a pure value.
6+
#[derive(Debug, Clone, Copy, Eq, PartialEq, Hash)]
7+
pub struct CodedSymbol<T: Encodable> {
8+
/// Values aggregated by XOR operation.
9+
pub value: T,
10+
/// We repurpose the two least significant bits of the checksum:
11+
/// - The least significant bit is a one bit counter which is incremented for each entity.
12+
/// This bit must be set when there is a single entity represented by this hash.
13+
/// - The second least significant bit indicates whether the entity is a deletion or insertion.
14+
pub checksum: u64,
15+
}
16+
17+
impl<T: Encodable> Default for CodedSymbol<T> {
18+
fn default() -> Self {
19+
CodedSymbol {
20+
value: T::zero(),
21+
checksum: 0,
22+
}
23+
}
24+
}
25+
26+
impl<T: Encodable> From<(T, u64)> for CodedSymbol<T> {
27+
fn from(tuple: (T, u64)) -> Self {
28+
Self {
29+
value: tuple.0,
30+
checksum: tuple.1,
31+
}
32+
}
33+
}
34+
35+
impl<T: Encodable> CodedSymbol<T> {
36+
/// Creates a new coded symbol with the given hash and deletion flag.
37+
pub(crate) fn new<S: HashFunctions<T>>(state: &S, hash: T, deletion: bool) -> Self {
38+
let mut checksum = state.check_sum(&hash);
39+
checksum |= 1; // Add a single bit counter
40+
if deletion {
41+
checksum = checksum.wrapping_neg();
42+
}
43+
CodedSymbol {
44+
value: hash,
45+
checksum,
46+
}
47+
}
48+
49+
/// Merges another coded symbol into this one.
50+
pub(crate) fn add(&mut self, other: &CodedSymbol<T>, negate: bool) {
51+
self.value.xor(other.value);
52+
if negate {
53+
self.checksum = self.checksum.wrapping_sub(other.checksum);
54+
} else {
55+
self.checksum = self.checksum.wrapping_add(other.checksum);
56+
}
57+
}
58+
59+
/// Checks whether this coded symbol is pure, i.e., whether it represents a single entity
60+
/// A pure coded symbol must satisfy the following conditions:
61+
/// - The 1-bit counter must be 1 or -1 (which are both represented by the bit being set)
62+
/// - The checksum must match the checksum of the value.
63+
/// - The indices of the value must match the index of this coded symbol.
64+
pub(crate) fn is_pure<S: HashFunctions<T>>(
65+
&self,
66+
state: &S,
67+
i: usize,
68+
len: usize,
69+
) -> (bool, usize) {
70+
if self.checksum & 1 == 0 {
71+
return (false, 0);
72+
}
73+
let multiplicity = indices_contains(state, &self.value, len, i);
74+
if multiplicity != 1 {
75+
return (false, 0);
76+
}
77+
let checksum = state.check_sum(&self.value) | 1;
78+
if checksum == self.checksum || checksum.wrapping_neg() == self.checksum {
79+
(true, 0)
80+
} else {
81+
let required_bits = self
82+
.checksum
83+
.wrapping_sub(checksum)
84+
.leading_zeros()
85+
.max(self.checksum.wrapping_add(checksum).leading_zeros())
86+
as usize;
87+
(false, required_bits)
88+
}
89+
}
90+
91+
/// Checks whether this coded symbol is zero, i.e., whether it represents no entity.
92+
pub(crate) fn is_zero(&self) -> bool {
93+
self.checksum == 0 && self.value == T::zero()
94+
}
95+
96+
/// Checks whether this coded symbol represents a deletion.
97+
pub(crate) fn is_deletion<S: HashFunctions<T>>(&self, state: &S) -> bool {
98+
let checksum = state.check_sum(&self.value) | 1;
99+
checksum != self.checksum
100+
}
101+
}
102+
103+
/// This function checks efficiently whether the given index is contained in the indices.
104+
///
105+
/// Note: we have constructed the indices such that we can determine from the last 5 bits
106+
/// which hash function would map to this index. Therefore, we only need to check against
107+
/// a single hash function and not all 5!
108+
/// The only exception is for very small indices (0..32) or if the index is a multiple of 32.
109+
///
110+
/// The function returns the multiplicity, i.e. how many indices hit this particular index.
111+
/// Thereby, it takes into account whether the value is stored negated or not.
112+
fn indices_contains<T: std::hash::Hash>(
113+
state: &impl HashFunctions<T>,
114+
value: &T,
115+
stream_len: usize,
116+
i: usize,
117+
) -> i32 {
118+
if stream_len > 32 && i % 32 != 0 {
119+
let seed = i % 4;
120+
let j = index_for_seed(state, value, stream_len, seed as u32);
121+
if i == j { 1 } else { 0 }
122+
} else {
123+
indices(state, value, stream_len)
124+
.map(|j| if j == i { 1 } else { 0 })
125+
.sum()
126+
}
127+
}

crates/hriblt/src/decoded_value.rs

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
/// A value that has been found by the set reconciliation algorithm.
2+
#[derive(Debug, Clone, Copy, Eq, PartialEq, Hash, PartialOrd, Ord)]
3+
pub enum DecodedValue<T> {
4+
/// A value that has been added
5+
Addition(T),
6+
/// A value that has been removed
7+
Deletion(T),
8+
}
9+
10+
impl<T> DecodedValue<T> {
11+
/// Consume this `DecodedValue` to return the value
12+
pub fn into_value(self) -> T {
13+
match self {
14+
DecodedValue::Addition(v) => v,
15+
DecodedValue::Deletion(v) => v,
16+
}
17+
}
18+
19+
/// Borrow the value within this decoded value.
20+
pub fn value(&self) -> &T {
21+
match self {
22+
DecodedValue::Addition(v) => v,
23+
DecodedValue::Deletion(v) => v,
24+
}
25+
}
26+
27+
/// Returns true if this decoded value is a deletion
28+
pub fn is_deletion(&self) -> bool {
29+
matches!(self, DecodedValue::Deletion(_))
30+
}
31+
}

0 commit comments

Comments
 (0)