Frequent Itemset Mining in Scala and Spark

Implementation of Apriori and FP-Growth in Scala, as well as their counterpart distributed implementations YAFIM, RApriori and DFPS using Spark.

This work has been developed as part of my CS undergraduate thesis where three previously proposed distributed algorithms were implemented in Spark and compared on Amazon EMR.

Algorithms

Sequential implementation of Apriori and FP-Growth in Scala.
- Apriori.scala
- FPGrowth.scala and FPTree.scala
- Apriori using a Hash Tree for filtering: AprioriHashTree.scala
- Hash Tree implementation: HashTree.scala
Distributed implementations of Apriori and FP-Growth using Spark.
- YAFIM.scala
- YAFIM using a Hash Tree for filtering: YAFIMHashTree.scala
- RApriori.scala
- DFPS.scala
Distributed algorithms originally proposed by the following papers:

Experiments

The three distributed FIM algorithms were compared in terms of execution time and normalized speedup. All experiments were run on Amazon EMR using m5.xlarge (dual core) machines.

Execution time in seconds. Datasets replicated 9 times and increasing cluster size.

Normalized speedup factor, if the number of cores is increased to 10, the ideal speedup factor is 10 (meaning that the execution time has decreased by 10 times). Datasets replicated 9 times and increasing cluster size (X axis measured in cores).

Building and running

Install and set Java 8 to your path, e.g. sdk install java 8.0.252.hs-adpt and sdk use java 8.0.252.hs-adpt
Instsall sbt, e.g. sdk install sbt
Compile the project using sbt compile
Adjust the absolute path for the provided datasets in src/main/resources/defaultfim.properties
Run the experiments with sbt run and select the main class experiments.Runner

Sample output (execution times in seconds):

+-----------------+------------+-------+-------+-------+-------+------+
|           Class |    Dataset | Run 1 | Run 2 | Run 3 |  Mean |   SD |
+-----------------+------------+-------+-------+-------+-------+------+
| AprioriHashTree |   mushroom | 14.90 | 13.96 | 14.65 | 14.50 | 0.40 |
|        FPGrowth |   mushroom |  0.63 |  0.55 |  0.64 |  0.61 | 0.04 |
|   YAFIMHashTree |   mushroom |  7.26 |  6.64 |  7.92 |  7.27 | 0.52 |
|        RApriori |   mushroom |  6.71 |  6.79 |  7.45 |  6.98 | 0.33 |
|            DFPS |   mushroom |  2.49 |  2.31 |  2.27 |  2.36 | 0.10 |
|                 |            |       |       |       |       |      |
| AprioriHashTree | T10I4D100K |  3.68 |  2.39 |  2.42 |  2.83 | 0.60 |
|        FPGrowth | T10I4D100K |  3.26 |  3.27 |  3.06 |  3.20 | 0.10 |
|   YAFIMHashTree | T10I4D100K |  1.96 |  1.90 |  1.89 |  1.92 | 0.03 |
|        RApriori | T10I4D100K |  1.89 |  1.80 |  1.75 |  1.81 | 0.06 |
|            DFPS | T10I4D100K |  4.13 |  4.07 |  4.14 |  4.11 | 0.03 |
+-----------------+------------+-------+-------+-------+-------+------+

+------------+-----------------+----------+---------------+----------+------+
|            | AprioriHashTree | FPGrowth | YAFIMHashTree | RApriori | DFPS |
+------------+-----------------+----------+---------------+----------+------+
|   mushroom |           14.50 |     0.61 |          7.27 |     6.98 | 2.36 |
| T10I4D100K |            2.83 |     3.20 |          1.92 |     1.81 | 4.11 |
+------------+-----------------+----------+---------------+----------+------+

Additionally, algorithms can be run individually by instantiating its respective class. Example:

object Main {
  def main(args: Array[String]): Unit = {
    
    val dfps: SparkFIM = new DFPS()
    
    val fileName = "/projects/spark-scala/frequent-itemset-mining-spark/src/main/resources/datasets/mushroom.txt"
    val frequentItemsets = dfps.execute(fileName, separator = " ", minSupport = 0.65)
    
    Util.printItemsets(frequentItemsets)
    
  }
}

Would output:

Elapsed time: 2.06 seconds. Class: DFPS.
Found 39 itemsets
[1] {34}, {36}, {39}, {85}, {86}, {90}
[2] {34, 36}, {34, 39}, {34, 85}, {34, 86}, {34, 90}, {36, 85}, {36, 86}, {36, 90}, {39, 85}, {39, 86}, {85, 86}, {85, 90}, {86, 90}
[3] {34, 36, 85}, {34, 36, 86}, {34, 36, 90}, {34, 39, 85}, {34, 39, 86}, {34, 85, 86}, {34, 85, 90}, {34, 86, 90}, {36, 85, 86}, {36, 85, 90}, {36, 86, 90}, {39, 85, 86}, {85, 86, 90}
[4] {34, 36, 85, 86}, {34, 36, 85, 90}, {34, 36, 86, 90}, {34, 39, 85, 86}, {34, 85, 86, 90}, {36, 85, 86, 90}
[5] {34, 36, 85, 86, 90}

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
project		project
src		src
.gitignore		.gitignore
Final Paper - pt_BR.pdf		Final Paper - pt_BR.pdf
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

project

project

src

src

.gitignore

.gitignore

Final Paper - pt_BR.pdf

Final Paper - pt_BR.pdf

README.md

README.md

build.sbt

build.sbt

Repository files navigation

Frequent Itemset Mining in Scala and Spark

Algorithms

Experiments

Building and running

About

Releases

Packages

Languages

felipekunzler/frequent-itemset-mining-spark

Folders and files

Latest commit

History

Repository files navigation

Frequent Itemset Mining in Scala and Spark

Algorithms

Experiments

Building and running

About

Topics

Resources

Stars

Watchers

Forks

Languages