Skip to content

Sequential and distributed implementations of Apriori and FP-Growth algorithms using Scala and Spark.

Notifications You must be signed in to change notification settings

felipekunzler/frequent-itemset-mining-spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Frequent Itemset Mining in Scala and Spark

Implementation of Apriori and FP-Growth in Scala, as well as their counterpart distributed implementations YAFIM, RApriori and DFPS using Spark.

This work has been developed as part of my CS undergraduate thesis where three previously proposed distributed algorithms were implemented in Spark and compared on Amazon EMR.

Algorithms

Experiments

The three distributed FIM algorithms were compared in terms of execution time and normalized speedup. All experiments were run on Amazon EMR using m5.xlarge (dual core) machines.

  • Execution time in seconds. Datasets replicated 9 times and increasing cluster size.

  • Normalized speedup factor, if the number of cores is increased to 10, the ideal speedup factor is 10 (meaning that the execution time has decreased by 10 times). Datasets replicated 9 times and increasing cluster size (X axis measured in cores).

Building and running

  • Install and set Java 8 to your path, e.g. sdk install java 8.0.252.hs-adpt and sdk use java 8.0.252.hs-adpt
  • Instsall sbt, e.g. sdk install sbt
  • Compile the project using sbt compile
  • Adjust the absolute path for the provided datasets in src/main/resources/defaultfim.properties
  • Run the experiments with sbt run and select the main class experiments.Runner

Sample output (execution times in seconds):

+-----------------+------------+-------+-------+-------+-------+------+
|           Class |    Dataset | Run 1 | Run 2 | Run 3 |  Mean |   SD |
+-----------------+------------+-------+-------+-------+-------+------+
| AprioriHashTree |   mushroom | 14.90 | 13.96 | 14.65 | 14.50 | 0.40 |
|        FPGrowth |   mushroom |  0.63 |  0.55 |  0.64 |  0.61 | 0.04 |
|   YAFIMHashTree |   mushroom |  7.26 |  6.64 |  7.92 |  7.27 | 0.52 |
|        RApriori |   mushroom |  6.71 |  6.79 |  7.45 |  6.98 | 0.33 |
|            DFPS |   mushroom |  2.49 |  2.31 |  2.27 |  2.36 | 0.10 |
|                 |            |       |       |       |       |      |
| AprioriHashTree | T10I4D100K |  3.68 |  2.39 |  2.42 |  2.83 | 0.60 |
|        FPGrowth | T10I4D100K |  3.26 |  3.27 |  3.06 |  3.20 | 0.10 |
|   YAFIMHashTree | T10I4D100K |  1.96 |  1.90 |  1.89 |  1.92 | 0.03 |
|        RApriori | T10I4D100K |  1.89 |  1.80 |  1.75 |  1.81 | 0.06 |
|            DFPS | T10I4D100K |  4.13 |  4.07 |  4.14 |  4.11 | 0.03 |
+-----------------+------------+-------+-------+-------+-------+------+

+------------+-----------------+----------+---------------+----------+------+
|            | AprioriHashTree | FPGrowth | YAFIMHashTree | RApriori | DFPS |
+------------+-----------------+----------+---------------+----------+------+
|   mushroom |           14.50 |     0.61 |          7.27 |     6.98 | 2.36 |
| T10I4D100K |            2.83 |     3.20 |          1.92 |     1.81 | 4.11 |
+------------+-----------------+----------+---------------+----------+------+

Additionally, algorithms can be run individually by instantiating its respective class. Example:

object Main {
  def main(args: Array[String]): Unit = {
    
    val dfps: SparkFIM = new DFPS()
    
    val fileName = "/projects/spark-scala/frequent-itemset-mining-spark/src/main/resources/datasets/mushroom.txt"
    val frequentItemsets = dfps.execute(fileName, separator = " ", minSupport = 0.65)
    
    Util.printItemsets(frequentItemsets)
    
  }
}

Would output:

Elapsed time: 2.06 seconds. Class: DFPS.
Found 39 itemsets
[1] {34}, {36}, {39}, {85}, {86}, {90}
[2] {34, 36}, {34, 39}, {34, 85}, {34, 86}, {34, 90}, {36, 85}, {36, 86}, {36, 90}, {39, 85}, {39, 86}, {85, 86}, {85, 90}, {86, 90}
[3] {34, 36, 85}, {34, 36, 86}, {34, 36, 90}, {34, 39, 85}, {34, 39, 86}, {34, 85, 86}, {34, 85, 90}, {34, 86, 90}, {36, 85, 86}, {36, 85, 90}, {36, 86, 90}, {39, 85, 86}, {85, 86, 90}
[4] {34, 36, 85, 86}, {34, 36, 85, 90}, {34, 36, 86, 90}, {34, 39, 85, 86}, {34, 85, 86, 90}, {36, 85, 86, 90}
[5] {34, 36, 85, 86, 90}

About

Sequential and distributed implementations of Apriori and FP-Growth algorithms using Scala and Spark.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages