Skip to content

Locality Sensitive Hashing for semantic similarity (Python 3.x)

Notifications You must be signed in to change notification settings

italo-batista/lsh-semantic-similarity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Locality Sensitive Hashing for semantic similarity

forthebadge vs 3.x

LSH (Locality Sensitive Hashing) is primarily used to find, given a large set of documents, the near-duplicates among them. It can use hamming distance, jaccard coefficient, edit distance or other distance notion.

You can read the following tutorials if you want to understand more about it:

Although LSH is more to duplicated documents than to semantic similar ones, in this approach I make an effort to use LSH to calculate semantic similarity among texts. For that, the algorithm extracts, using TFIDF, the text's main tokens (or you can pre-calculate them and pass as param). Also, in this approach I use MinHash (which uses Jaccard similarity) as the Similarity function.

The overall aim is to reduce the number of comparisons needed to find similar items. LSH uses hash collisions to capture objects similarities. The hash collisions come in handy here as similar documents have a high probability of having the same hash value. The probability of a hash collision for a minhash is exactly the Jaccard similarity of two sets.

See this tutorial to see how use this LSH!

Run as following to install dependencies:

  python3 -m pip install -r requirements.txt

About

Locality Sensitive Hashing for semantic similarity (Python 3.x)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published