sketch is a generic, header-only library providing implementations of a variety of sketch data structures for scalable and streaming applications.
All have been accelerated with SIMD parallelism where possible, most are composable, and many are threadsafe unless -DNOT_THREADSAFE is passed as a compilation flag.
Documentation for the Python interface is available here.
We directly include blaze-lib, libpopcnt, compact_vector, ska::flat_hash_map, and xxHash for various utilities. We also have two submodules:
- pybind11, only used for python bindings.
- SLEEF for vectorized math, incorporated with vec.h. It's optionally used (disabled by defining
-DNO_SLEEF=1/#define NO_SLEEF 1) and only applicable to using rnla.h through blaze-lib.
You can ignore both for most use cases.
- HyperLogLog Implementation [hll.h]
hll_t/hllbase_t<HashStruct>- Estimates the cardinality of a set using log(log(cardinality)) bits.
- Threadsafe unless
-DNOT_THREADSAFEis passed. - Currently,
hllis the only structure for which python bindings are available, but we intend to extend this in the future.
- HyperBitBit [hbb.h]
- Better per-bit accuracy than HyperLogLogs, but, at least currently, limited to 128 bits/16 bytes in sketch size.
- Bloom Filter [bf.h]
bf_t/bfbase_t<HashStruct>- Naive bloom filter
- Currently not threadsafe.
- Count-Min and Count Sketches
- ccm.h (
ccmbase_t<UpdatePolicy=Increment>/ccm_t(usepccm_tfor Approximate Counting orcs_tfor a count sketch). - The Count sketch is threadsafe if
-DNOT_THREADSAFEis not passed or if an atomic container is used. Count-Min sketches are currently not threadsafe due to the use of minimal updates. - Count-min sketches can support concept drift if
realccm_tfrom mult.h is used.
- ccm.h (
- MinHash sketches
- mh.h (
RangeMinHashis the currently verified implementation.) We recommend you build the sketch and then convert to a linear container (e.g., astd::vector) usingto_container<ContainerType>()or.finalize()for faster comparisons.- BottomKHasher is an alternate that uses more space to reduce runtime, which finalizes() into the same structure.
- CountingRangeMinHash performs the same operations as RangeMinHash, but provides multiplicities, which facilitates
histogram_similarity, a generalization of Jaccard with multiplicities. - Both CountingRangeMinHash and RangeMinHash can be finalized into containers for fast comparisons with
.finalize(). - A draft HyperMinHash implementation is available as well, but it has not been thoroughly vetted.
- Range MinHash implementationsare not threadsafe.
- HyperMinHash implementation is threa
- mh.h (
- B-Bit MinHash
- bbmh.h
- One-permutation (partition) bbit minhash
- Threadsafe, bit-packed and fully SIMD-accelerated
- Power of two partitions are supported in BBitMinHasher, which is finalized into a FinalBBitMinHash sketch. This is faster than the alternative.
- We also support arbitrary divisions using fastmod64 with DivBBitMinHasher and its corresponding final sketch, FinalDivBBitMinHash.
- One-permutation counting bbit minhash
- Not threadsafe.
- ModHash sketches
- mod.h
- Estimates both containment and jaccard index, but takes a (potentially) unbounded space.
- This returns a FinalRMinHash sketch, reusing the infrastructure for minhash sketches, but which calculates Jaccard index and containment knowing that it was generated via mod, not min.
- HeavyKeeper
- hk.h
- Reference: https://www.usenix.org/conference/atc18/presentation/gong
- A seemingly unilateral improvement over count-min sketches.
- One drawback is the inability to delete items, which makes it unsuitable for sliding windows.
- It shares this characteristic with the Count-Min sketch with conservative update and the Count-Min Mean sketch.
- ntcard
- mult.h
- Threadsafe
- Reference: https://www.ncbi.nlm.nih.gov/pubmed/28453674
- Not SIMD-accelerated, but also general, supporting any arbitrary coverage level
- PCSA
- pc.h
- The HLL is more performant and better-optimized, but this is included for completeness.
- Not threadsafe.
- Reference: https://dl.acm.org/doi/10.1016/0022-0000%2885%2990041-8 Journal of Computer and System Sciences. September 1985 https://doi.org/10.1016/0022-0000(85)90041-8
- SetSketch
- See setsketch.h for continuous and discretized versions of the SetSketch.
- This also includes parameter-setting code.
To build and run the hll test case:
make test && ./testTo use as a header-only library:
using namespace sketch;
hll::hll_t hll(14); // Use 2**14 bytes for this structure
// Add hashed values for each element to the structure.
for(uint64_t i(0); i < 10000000ull; ++i) hll.addh(i);
fprintf(stderr, "Elements estimated: %lf. Error bounds: %lf.\n", hll.report(), hll.est_err());The other structures work with a similar interface. See the type constructors for more information or view 10xdash for examples on using the same interface for a variety of data structures.
Simply #include sketch/<header_name>, or, for one include #include <sketch/sketch.h>,
which allows you to write sketch::bf_t and sketch::hll_t without the subnamespaces.
We use inline namespaces for individual types of sketches, e.g., sketch::minhash or sketch::hll can be used for clarity, or sketch::hll_t can be used, omitting the hll namespace.
Clang on OSX may fail to compile in AVX512 mode. We recommend using homebrew's gcc:
homebrew upgrade gcc || homebrew install gcc
and either setting the environmental variables for CXX and CC to the most recent g++/gcc or providing them as Makefile arguments.
At the time of writing, this is g++-10 and gcc-10.
By default, updates to the hyperloglog structure to occur using atomic operations, though threading should be handled by the calling code. Otherwise, the flag -DNOT_THREADSAFE should be passed. The cost of this is relatively minor, but in single-threaded situations, this would be preferred.
Python bindings are available via pybind11. Simply cd python && python setup.py install.
The package has been renamed to sketch_ds as of v0.19
Utilities include: 1. Sketching/serialization for sketch data structures 1. Supported: sketch_ds.bbmh.BBitMinHasher, sketch_ds.bf.bf, sketch_ds.hmh.hmh, sketch_ds.hll.hll 2. shs_isz, which computes the intersection size of sorted hash sets. 1. Supported: {uint,int}{32,64}, float32, float64 3. fastmod/fastdiv, which uses the fast modulo reduction to do faster division/mod than numpy. 1. Supportd: {uint,int}{32,64} 4. matrix generation functions - taking a list of sketches and creating the similarity function matrix. 1. Supported: sketch_ds.bbmh.BBitMinHasher, sketch_ds.bf.bf, sketch_ds.hmh.hmh, sketch_ds.hll.hll 2. Types: "jaccard_matrix", "intersection_matrix", "containment_matrix", "union_size_matrix", "symmetric_containment_matrix" 3. Returns a compressed distance matrix. 5. ccount_eq, pcount_eq compute the number of identical registers between integral registers. 1. Inspired by cdist and pdist from scipy.spatial.distance 2. ccount_eq computes the number of identical registers between all pairs of rows between two matrices A and B. 1. Size of returned matrix: (A.shape[0], A.shape[1]) 3. pcount_eq computes the number of identical registers between all pairs of rows in a single matrix A. 1. Size of returned matrix: (A.shape[0] * (A.shape[0]) - 1) / 2 4. pcount_eq output can be transformed from similarities to distances via -np.log(distmat / A.shape[1]).