<dependency>
<groupId>com.github.itspawanbhardwaj</groupId>
<artifactId>spark-fuzzy-matching_2.10</artifactId>
<version>1.0.0</version>
</dependency>
<dependency>
<groupId>com.github.itspawanbhardwaj</groupId>
<artifactId>spark-fuzzy-matching_2.11</artifactId>
<version>1.0.1</version>
</dependency>
- Dice / Sorensen (Similarity metric)
- Double Metaphone phonetic metric and algorithm)
- Hamming (Similarity metric)
- Jaccard (Similarity metric)
- Jaro (Similarity metric)
- Jaro-Winkler (Similarity metric)
- Levenshtein (Similarity metric)
- Metaphone (Phonetic metric and algorithm)
- Monge-Elkan similarity metric)
- Match Rating Approach phonetic metric and algorithm)
- Needleman-Wunch similarity metric)
- N-Gram (Similarity metric)
- NYSIIS (Phonetic metric and algorithm)
- Overlap (Similarity metric)
- Ratcliff-Obershelp (Similarity metric)
- Refined NYSIIS (Phonetic metric and algorithm)
- Refined Soundex (Phonetic metric and algorithm)
- Tanimoto similarity metric)
- Tversky similarity metric)
- Smith-Waterman similarity metric)
- Soundex (Phonetic metric and algorithm)
- Weighted Levenshtein (Similarity metric)
- All functions are defined under
com.pb.fuzzy.matching.functions
.
import com.pb.fuzzy.matching.functions._ // import to use fuzzy matching functions
levenshteinFn(document, document1)
diceSorensenFn(document, document1, nGramSize)
hammingFn(document, document1)
jaccardFn(document, document1, nGramSize)
jaroFn(document, document1)
jaroWinklerFn(document, document1)
nGramFn(document, document1, nGramSize)
overlapFn(document, document1, nGramSize)
ratcliffObershelpFn(document, document1)
weightedLevenshteinFn(document, document1, deleteWeight, insertWeight, substituteWeight)
metaphoneFn(document, document1)
computeMetaphoneFn(document)
nysiisFn(document, document1)
computeNysiisFn(document)
refinedNysiisFn(document, document1)
computeRefinedNysiisFn(document)
refinedSoundexFn(document, document1)
computeRefinedSoundexFn(document)
soundexFn(document, document1)
computeSoundexFn(document)
The project contains a FuzzyMatchingJoinExample which works as follows:
Dataset with proper names
+--------------------+--------------------+-------+
| title| gener|ratings|
+--------------------+--------------------+-------+
|The Shawshank Red...| Crime. Drama| 9.3|
| The Godfather| Crime. Drama| 9.2|
| The Dark Knight|Action. Crime. Drama| 9.0|
|The Godfather: Pa...| Crime. Drama| 9.0|
| Pulp Fiction| Crime. Drama| 8.9|
+--------------------+--------------------+-------+
only showing top 5 rows
Dataset with misspelled names
+--------------------+----+--------+
| title|year|duration|
+--------------------+----+--------+
|dhe Shwshnk Redem...|1994| 142|
| dhe Godfdher|1972| 175|
| dhe Drk Knighd|2008| 152|
|dhe Godfdher: Prd II|1974| 202|
| Pulp Ficdion|1994| 154|
+--------------------+----+--------+
only showing top 5 rows
Dataset after fuzzy join
+--------------------+--------------------+-------+--------------------+----+--------+
| title| gener|ratings| title|year|duration|
+--------------------+--------------------+-------+--------------------+----+--------+
|The Shawshank Red...| Crime. Drama| 9.3|dhe Shwshnk Redem...|1994| 142|
| The Godfather| Crime. Drama| 9.2| dhe Godfdher|1972| 175|
| The Dark Knight|Action. Crime. Drama| 9.0| dhe Drk Knighd|2008| 152|
| Pulp Fiction| Crime. Drama| 8.9| Pulp Ficdion|1994| 154|
| Schindler's List|Biography. Drama....| 8.9| Schindler's Lisd|1993| 195|
+--------------------+--------------------+-------+--------------------+----+--------+
only showing top 5 rows
stringmetric ( 🎯 String metrics and phonetic algorithms for Scala (e.g. Dice/Sorensen, Hamming, Jaccard, Jaro, Jaro-Winkler, Levenshtein, Metaphone, N-Gram, NYSIIS, Overlap, Ratcliff/Obershelp, Refined NYSIIS, Refined Soundex, Soundex, Weighted Levenshtein). )