Skip to content

Commit f65e262

Browse files
committed
Add readme and script to generate plots
1 parent 39c65cc commit f65e262

File tree

4 files changed

+118
-1
lines changed

4 files changed

+118
-1
lines changed

crates/hriblt/docs/sizing.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ The number of coded symbols required to find the difference between two sets is
1010

1111
`y = len(coded_symbols) / diff_size`
1212

13-
![Coded symbol multiplier](./assets/coded-symbol-multiplier.png)
13+
![Coded symbol multiplier](../evaulation/overhead/overhead.png)
1414

1515
For small diffs, the number of coded symbols required per value is larger, after a difference of approximately 100 values the coefficient settles on around 1.3 to 1.4.
1616

crates/hriblt/evaluation/README.md

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# Evaluation
2+
3+
The crate includes an overhead evaluation to measure the efficiency of the set reconciliation algorithm.
4+
Below are instructions on how to run the evaluation.
5+
6+
_Note that all of these should be run from the crate's root directory!_
7+
8+
## Overhead Evaluation
9+
10+
The overhead evaluation measures how many coded symbols are required to successfully decode set differences of various sizes.
11+
The key metric is the **overhead multiplier**: the ratio of coded symbols needed to the actual diff size.
12+
13+
See [overhead results](evaluation/overhead.md) for the predefined configurations.
14+
15+
### Running the Evaluation
16+
17+
1. Ensure R and ImageMagick are installed with necessary packages:
18+
19+
- Install R from [download](https://cran.r-project.org/) or using your platform's package manager.
20+
- Start `R` and install packages by executing `install.packages(c('dplyr', 'ggplot2', 'readr', 'stringr', 'scales'))`.
21+
- Install ImageMagick using the official [instructions](https://imagemagick.org/script/download.php).
22+
- Install Ghostscript for PDF conversion: `brew install ghostscript` (macOS) or `apt install ghostscript` (Linux).
23+
24+
2. Run the benchmark tool directly to see available options:
25+
26+
cargo run --release --features bin --bin hriblt-bench -- --help
27+
28+
Example: run 1000 trials with diff sizes from 1 to 100:
29+
30+
cargo run --release --features bin --bin hriblt-bench -- \
31+
--trials 10000 \
32+
--set-size 1000 \
33+
--diff-size '1..101' \
34+
--diff-mode incremental \
35+
--tsv
36+
37+
3. To generate a plot from TSV output:
38+
39+
cargo run --release --features bin --bin hriblt-bench -- \
40+
--trials 10000 --diff-size '1..101' --diff-mode incremental --tsv > overhead.tsv
41+
42+
evaluation/plot-overhead.r overhead.tsv overhead.pdf
43+
44+
### TSV Output Format
45+
46+
When using `--tsv`, the benchmark outputs tab-separated data with the following columns:
47+
48+
| Column | Description |
49+
|--------|-------------|
50+
| `trial` | Trial number (1-indexed) |
51+
| `set_size` | Number of elements in each set |
52+
| `diff_size` | Number of differences between sets |
53+
| `success` | Whether decoding succeeded (`true`/`false`) |
54+
| `coded_symbols` | Number of coded symbols needed to decode |
55+
| `overhead` | Ratio of coded_symbols to diff_size |
56+
57+
### Regenerating Documentation Plots
58+
59+
To regenerate the plots in the repository, ensure ImageMagick is available, and run:
60+
61+
scripts/generate-overhead-plots
62+
63+
_Note that this will always re-run the benchmark!_
64+
65+
## Understanding the Results
66+
67+
The overhead plot shows percentiles of the overhead multiplier across different diff sizes:
68+
69+
- **Gray region**: Full range (p0 to p99) - min to ~max overhead observed
70+
- **Blue region**: Interquartile range (p25 to p75) - where 50% of trials fall
71+
- **Dark line**: Median (p50) - typical overhead
72+
73+
A lower overhead means more efficient encoding. An overhead of 1.0x would mean the number of coded symbols exactly equals the diff size (theoretical minimum).
74+
75+
Typical results show:
76+
- Small diff sizes (1-10) have higher variance and overhead due to the probabilistic nature of the algorithm
77+
- Larger diff sizes (50+) converge to a more stable overhead around 1.3-1.5x
78+
- The algorithm successfully decodes 100% of trials when given up to 10x the diff size in coded symbols
196 KB
Loading
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
#!/usr/bin/env bash
2+
3+
## Run overhead evaluation and generate the plots for the documentation.
4+
5+
set -eu
6+
7+
cd "$(dirname "$0")/.."
8+
9+
plots_dir=evaluation/overhead
10+
tmp_dir=$(mktemp -d)
11+
trap "rm -rf $tmp_dir" EXIT
12+
13+
# Run benchmark with a good range of diff sizes and enough trials for statistical significance
14+
# Performs 100 trials per diff size
15+
cargo run --release --features bin --bin hriblt-bench -- \
16+
--trials 20000 \
17+
--set-size 1000 \
18+
--diff-size '1..=200' \
19+
--diff-mode incremental \
20+
--tsv \
21+
--seed 42 \
22+
> "$tmp_dir/overhead.tsv"
23+
24+
# Generate the PDF plot
25+
evaluation/plot-overhead.r "$tmp_dir/overhead.tsv" "$tmp_dir/overhead.pdf"
26+
27+
# Create output directory
28+
rm -rf "$plots_dir"
29+
mkdir -p "$plots_dir"
30+
31+
# Convert PDF to PNG for documentation (requires ghostscript for PDF support)
32+
# Use 'magick' for ImageMagick v7+, fall back to 'convert' for older versions
33+
if command -v magick &> /dev/null; then
34+
magick -density 300 "$tmp_dir/overhead.pdf" -resize 1024x1024 -alpha remove -alpha off "$plots_dir/overhead.png"
35+
else
36+
convert -density 300 "$tmp_dir/overhead.pdf" -resize 1024x1024 -alpha remove -alpha off "$plots_dir/overhead.png"
37+
fi
38+
39+
echo "Generated plots in $plots_dir/"

0 commit comments

Comments
 (0)