Skip to content

Calculating Mandelbrot Set using Spark via Kubernetes cluster

License

Notifications You must be signed in to change notification settings

krzsam/scala-spark-mandelbrot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scala-spark-mandelbrot

An example application to calculate a Mandelbrot Set using Spark. The goal is to demonstrate use of Spark computing framework to perform calculations while using Kubernetes as the underlying cluster management platform. The solution is not intended to be optimal in terms of choice of technologies or performance.

Application structure

The calculation of the Mandelbrot Set is done in two steps:

  • Input data file generation
  • Iteration process and creation of the image

Input file generation

First, an input data file needs to be created with data points for the iteration process. The data has two types of coordinates combined together within one data line:

  • image space coordinates
  • data point coordinates - c value in the iteration formula

The format of the one line is as below. In image space coordinates, top left corner is (0,0), and bottom right corner is defined by -sx and -sy parameters (horizontal and vertical image sizes respectively) - see run-spark-k8s-mand-generate.sh batch below.

The application currently provides the same number of iterations for each data point in the input data file - providing number of iterations for each data point separately makes the calculation logic simpler and theoretically more flexible if necessary.

The input data file is created by mandelbrot.Main.generateInputData function.

image-x,image-y,position-x,position-y,number-of-iterations

Where:

  • image-x : X position of the data point in image space
  • image-y: Y position of the data point in image space
  • position-x : X position of the data point in the data space
  • position-y : Y position of the data point in the data space
  • number-of-iterations : number of iterations for the given data point.

Batch file to generate the input data file: run-spark-k8s-mand-generate.sh

hdfs dfs -put -f scala-spark-mandelbrot-assembly-0.3.jar /scala-spark-mandelbrot-assembly-0.3.jar
hdfs dfs -rm -r -f /input-800-600

$SPARK_HOME/bin/spark-submit 
    --class mandelbrot.Main 
    --master k8s://172.31.36.93:6443 
    --deploy-mode cluster 
    --executor-memory 1G 
    --total-executor-cores 3 
    --name mandelbrot 
    --conf spark.kubernetes.container.image=krzsam/spark:spark-docker 
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark 
    hdfs://ip-172-31-36-93:4444/scala-spark-mandelbrot-assembly-0.3.jar 
    -h hdfs://ip-172-31-36-93:4444/ 
    -c generate 
    -f hdfs://ip-172-31-36-93:4444/input-800-600 
    -tl -2.2,1.2 
    -br 1.0,-1.2 
    -sx 800 
    -sy 600 
    -i 1024

Application parameters:

  • -h : specifies HDFS URI
  • -c : specifies step, for input file generation the value should be generate
  • -f : specifies HDFS location for the input data - please bear in mind that input data file will be in fact a directory as the file will be partitioned by Spark and will be physically represented as a directory with files containing partitioned data. Currently the application writes the input file data in Parquet format, but this can be changed to any other supported format.
  • -tl : specifies top-left corner in input data coordinates. Provided as comma separated pair of decimal numbers.
  • -br : specifies bottom-right corner in input data coordinates. Provided as comma separated pair of decimal numbers.
  • -sx : specifies horizontal image size
  • -sy : specifies vertical image size
  • -i : number of iterations each data point will be iterated through

Iteration process and creation of the image

Once the input data file is created on HDFS, the generation process reads it line by line and iterates each data point defined by each input line using the Mandelbrot Set formula using batch file run-spark-k8s-mand-calculate.sh. The actual calculation is done by mandelbrot.Main.calculateImage function.

The size of the image to generate is gathered for the input data using:

    val dim_x = calculated.agg( sql.functions.max( "img_x" ) ).head().getInt( 0 ) + 1
    val dim_y = calculated.agg( sql.functions.max( "img_y" ) ).head().getInt( 0 ) + 1

The above is not optimal from the performance perspective as input data needs to be re-analyzed, but is done that way as an example of using aggregation functions.

hdfs dfs -put -f scala-spark-mandelbrot-assembly-0.3.jar /scala-spark-mandelbrot-assembly-0.3.jar

$SPARK_HOME/bin/spark-submit 
    --class mandelbrot.Main 
    --master k8s://172.31.36.93:6443 
    --deploy-mode cluster 
    --executor-memory 1G 
    --total-executor-cores 3 
    --name mandelbrot 
    --conf spark.kubernetes.container.image=krzsam/spark:spark-docker 
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark 
    hdfs://ip-172-31-36-93:4444/scala-spark-mandelbrot-assembly-0.3.jar 
    -h hdfs://ip-172-31-36-93:4444/ 
    -c calculate 
    -f hdfs://ip-172-31-36-93:4444/input-800-600

Application parameters:

  • -h : specifies HDFS URI
  • -c : specifies step, for calculation and image generation the value should be calculate
  • -f : specifies HDFS location for the input data file as created in the generation step above.

The calculation step produces a single PNG file on HDFS, and the name corresponds to the input data file as below:

input-800-600.<timestamp>.png

An example file produced by the calculation is shown below:

Example Mandelbrot Set

Application build

The application is build in the same way and uses the same Spark Docker image as created for /github.com/krzsam scala-spark-example project.

For specific details on running Spark on Kubernetes including creating Spark Docker image please refer to Running application on Spark via Kubernetes

Links

About

Calculating Mandelbrot Set using Spark via Kubernetes cluster

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published