Add readme and script to generate plots

itsibitzi · itsibitzi · commit f65e2620af05 · 2026-01-19T17:08:52.000Z
diff --git a/crates/hriblt/docs/sizing.md b/crates/hriblt/docs/sizing.md
@@ -10,7 +10,7 @@ The number of coded symbols required to find the difference between two sets is
 
 `y = len(coded_symbols) / diff_size`
 
-![Coded symbol multiplier](./assets/coded-symbol-multiplier.png)
+![Coded symbol multiplier](../evaulation/overhead/overhead.png)
 
 For small diffs, the number of coded symbols required per value is larger, after a difference of approximately 100 values the coefficient settles on around 1.3 to 1.4.
 
diff --git a/crates/hriblt/evaluation/README.md b/crates/hriblt/evaluation/README.md
@@ -0,0 +1,78 @@
+# Evaluation
+
+The crate includes an overhead evaluation to measure the efficiency of the set reconciliation algorithm.
+Below are instructions on how to run the evaluation.
+
+_Note that all of these should be run from the crate's root directory!_
+
+## Overhead Evaluation
+
+The overhead evaluation measures how many coded symbols are required to successfully decode set differences of various sizes.
+The key metric is the **overhead multiplier**: the ratio of coded symbols needed to the actual diff size.
+
+See [overhead results](evaluation/overhead.md) for the predefined configurations.
+
+### Running the Evaluation
+
+1. Ensure R and ImageMagick are installed with necessary packages:
+
+   - Install R from [download](https://cran.r-project.org/) or using your platform's package manager.
+   - Start `R` and install packages by executing `install.packages(c('dplyr', 'ggplot2', 'readr', 'stringr', 'scales'))`.
+   - Install ImageMagick using the official [instructions](https://imagemagick.org/script/download.php).
+   - Install Ghostscript for PDF conversion: `brew install ghostscript` (macOS) or `apt install ghostscript` (Linux).
+
+2. Run the benchmark tool directly to see available options:
+
+       cargo run --release --features bin --bin hriblt-bench -- --help
+
+   Example: run 1000 trials with diff sizes from 1 to 100:
+
+       cargo run --release --features bin --bin hriblt-bench -- \
+           --trials 10000 \
+           --set-size 1000 \
+           --diff-size '1..101' \
+           --diff-mode incremental \
+           --tsv
+
+3. To generate a plot from TSV output:
+
+       cargo run --release --features bin --bin hriblt-bench -- \
+           --trials 10000 --diff-size '1..101' --diff-mode incremental --tsv > overhead.tsv
+       
+       evaluation/plot-overhead.r overhead.tsv overhead.pdf
+
+### TSV Output Format
+
+When using `--tsv`, the benchmark outputs tab-separated data with the following columns:
+
+| Column | Description |
+|--------|-------------|
+| `trial` | Trial number (1-indexed) |
+| `set_size` | Number of elements in each set |
+| `diff_size` | Number of differences between sets |
+| `success` | Whether decoding succeeded (`true`/`false`) |
+| `coded_symbols` | Number of coded symbols needed to decode |
+| `overhead` | Ratio of coded_symbols to diff_size |
+
+### Regenerating Documentation Plots
+
+To regenerate the plots in the repository, ensure ImageMagick is available, and run:
+
+    scripts/generate-overhead-plots
+
+_Note that this will always re-run the benchmark!_
+
+## Understanding the Results
+
+The overhead plot shows percentiles of the overhead multiplier across different diff sizes:
+
+- **Gray region**: Full range (p0 to p99) - min to ~max overhead observed
+- **Blue region**: Interquartile range (p25 to p75) - where 50% of trials fall
+- **Dark line**: Median (p50) - typical overhead
+
+A lower overhead means more efficient encoding. An overhead of 1.0x would mean the number of coded symbols exactly equals the diff size (theoretical minimum).
+
+Typical results show:
+- Small diff sizes (1-10) have higher variance and overhead due to the probabilistic nature of the algorithm
+- Larger diff sizes (50+) converge to a more stable overhead around 1.3-1.5x
+- The algorithm successfully decodes 100% of trials when given up to 10x the diff size in coded symbols
diff --git a/crates/hriblt/evaluation/overhead/overhead.png b/crates/hriblt/evaluation/overhead/overhead.png
diff --git a/crates/hriblt/scripts/generate-overhead-plots b/crates/hriblt/scripts/generate-overhead-plots
@@ -0,0 +1,39 @@
+#!/usr/bin/env bash
+
+## Run overhead evaluation and generate the plots for the documentation.
+
+set -eu
+
+cd "$(dirname "$0")/.."
+
+plots_dir=evaluation/overhead
+tmp_dir=$(mktemp -d)
+trap "rm -rf $tmp_dir" EXIT
+
+# Run benchmark with a good range of diff sizes and enough trials for statistical significance
+# Performs 100 trials per diff size
+cargo run --release --features bin --bin hriblt-bench -- \
+    --trials 20000 \
+    --set-size 1000 \
+    --diff-size '1..=200' \
+    --diff-mode incremental \
+    --tsv \
+    --seed 42 \
+    > "$tmp_dir/overhead.tsv"
+
+# Generate the PDF plot
+evaluation/plot-overhead.r "$tmp_dir/overhead.tsv" "$tmp_dir/overhead.pdf"
+
+# Create output directory
+rm -rf "$plots_dir"
+mkdir -p "$plots_dir"
+
+# Convert PDF to PNG for documentation (requires ghostscript for PDF support)
+# Use 'magick' for ImageMagick v7+, fall back to 'convert' for older versions
+if command -v magick &> /dev/null; then
+    magick -density 300 "$tmp_dir/overhead.pdf" -resize 1024x1024 -alpha remove -alpha off "$plots_dir/overhead.png"
+else
+    convert -density 300 "$tmp_dir/overhead.pdf" -resize 1024x1024 -alpha remove -alpha off "$plots_dir/overhead.png"
+fi
+
+echo "Generated plots in $plots_dir/"