GitHub - itspawanbhardwaj/spark-fuzzy-matching: Fuzzy matching function in spark (https://spark-packages.org/package/itspawanbhardwaj/spark-fuzzy-matching)

Maven Central

For Scala 2.10

<dependency>
  <groupId>com.github.itspawanbhardwaj</groupId>
  <artifactId>spark-fuzzy-matching_2.10</artifactId>
  <version>1.0.0</version>
</dependency>

For Scala 2.11

<dependency>
  <groupId>com.github.itspawanbhardwaj</groupId>
  <artifactId>spark-fuzzy-matching_2.11</artifactId>
  <version>1.0.1</version>
</dependency>

Metrics and algorithms

Dice / Sorensen (Similarity metric)
Double Metaphone phonetic metric and algorithm)
Hamming (Similarity metric)
Jaccard (Similarity metric)
Jaro (Similarity metric)
Jaro-Winkler (Similarity metric)
Levenshtein (Similarity metric)
Metaphone (Phonetic metric and algorithm)
Monge-Elkan similarity metric)
Match Rating Approach phonetic metric and algorithm)
Needleman-Wunch similarity metric)
N-Gram (Similarity metric)
NYSIIS (Phonetic metric and algorithm)
Overlap (Similarity metric)
Ratcliff-Obershelp (Similarity metric)
Refined NYSIIS (Phonetic metric and algorithm)
Refined Soundex (Phonetic metric and algorithm)
Tanimoto similarity metric)
Tversky similarity metric)
Smith-Waterman similarity metric)
Soundex (Phonetic metric and algorithm)
Weighted Levenshtein (Similarity metric)

Functions

All functions are defined under com.pb.fuzzy.matching.functions.

import com.pb.fuzzy.matching.functions._ // import to use fuzzy matching functions

  
  levenshteinFn(document, document1)
  diceSorensenFn(document, document1, nGramSize)
  hammingFn(document, document1)
  jaccardFn(document, document1, nGramSize)
  jaroFn(document, document1)
  jaroWinklerFn(document, document1)
  nGramFn(document, document1, nGramSize)
  overlapFn(document, document1, nGramSize)
  ratcliffObershelpFn(document, document1)
  weightedLevenshteinFn(document, document1, deleteWeight, insertWeight, substituteWeight)
  metaphoneFn(document, document1)
  computeMetaphoneFn(document)
  nysiisFn(document, document1)
  computeNysiisFn(document)
  refinedNysiisFn(document, document1)
  computeRefinedNysiisFn(document)
  refinedSoundexFn(document, document1)
  computeRefinedSoundexFn(document)
  soundexFn(document, document1)
  computeSoundexFn(document)

Example

The project contains a FuzzyMatchingJoinExample which works as follows:

Dataset with proper names
+--------------------+--------------------+-------+
|               title|               gener|ratings|
+--------------------+--------------------+-------+
|The Shawshank Red...|        Crime. Drama|    9.3|
|       The Godfather|        Crime. Drama|    9.2|
|     The Dark Knight|Action. Crime. Drama|    9.0|
|The Godfather: Pa...|        Crime. Drama|    9.0|
|        Pulp Fiction|        Crime. Drama|    8.9|
+--------------------+--------------------+-------+
only showing top 5 rows

Dataset with misspelled names
+--------------------+----+--------+
|               title|year|duration|
+--------------------+----+--------+
|dhe Shwshnk Redem...|1994|     142|
|        dhe Godfdher|1972|     175|
|      dhe Drk Knighd|2008|     152|
|dhe Godfdher: Prd II|1974|     202|
|        Pulp Ficdion|1994|     154|
+--------------------+----+--------+
only showing top 5 rows

Dataset after fuzzy join
+--------------------+--------------------+-------+--------------------+----+--------+
|               title|               gener|ratings|               title|year|duration|
+--------------------+--------------------+-------+--------------------+----+--------+
|The Shawshank Red...|        Crime. Drama|    9.3|dhe Shwshnk Redem...|1994|     142|
|       The Godfather|        Crime. Drama|    9.2|        dhe Godfdher|1972|     175|
|     The Dark Knight|Action. Crime. Drama|    9.0|      dhe Drk Knighd|2008|     152|
|        Pulp Fiction|        Crime. Drama|    8.9|        Pulp Ficdion|1994|     154|
|    Schindler's List|Biography. Drama....|    8.9|    Schindler's Lisd|1993|     195|
+--------------------+--------------------+-------+--------------------+----+--------+
only showing top 5 rows

Library used

stringmetric ( 🎯 String metrics and phonetic algorithms for Scala (e.g. Dice/Sorensen, Hamming, Jaccard, Jaro, Jaro-Winkler, Levenshtein, Metaphone, N-Gram, NYSIIS, Overlap, Ratcliff/Obershelp, Refined NYSIIS, Refined Soundex, Soundex, Weighted Levenshtein). )

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
bin/data		bin/data
jars		jars
project		project
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README.md~		README.md~
build.sbt		build.sbt
publish.sbt		publish.sbt
sonatype.sbt		sonatype.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Maven Central

For Scala 2.10

For Scala 2.11

Metrics and algorithms

Functions

Example

Library used

About

Releases

Packages

Languages

License

itspawanbhardwaj/spark-fuzzy-matching

Folders and files

Latest commit

History

Repository files navigation

Maven Central

For Scala 2.10

For Scala 2.11

Metrics and algorithms

Functions

Example

Library used

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages