Skip to content

MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++

License

Notifications You must be signed in to change notification settings

RJMillerLab/datasketch

 
 

Repository files navigation

datasketch: Big Data Looks Small

https://travis-ci.org/ekzhu/datasketch.svg?branch=master

datasketch gives you probabilistic data structures that can process and search very large amount of data super fast, with little loss of accuracy.

This package contains the following data sketches:

Data Sketch Usage
MinHash estimate Jaccard similarity and cardinality
Weighted MinHash estimate weighted Jaccard similarity
HyperLogLog estimate cardinality
HyperLogLog++ estimate cardinality

The following indexes for data sketches are provided to support sub-linear query time:

Index For Data Sketch Supported Query Type
MinHash LSH MinHash, Weighted MinHash Radius (Threshold)
MinHash LSH Forest MinHash, Weighted MinHash Top-K

datasketch must be used with Python 2.7 or above and NumPy 1.11 or above. Scipy is optional, but with it the LSH initialization can be much faster.

Install

To install datasketch using pip:

pip install datasketch -U

This will also install NumPy as dependency.

About

MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 96.3%
  • Makefile 3.7%