SparkScope is a monitoring and profiling tool for Spark Applications. It allows to review resource allocation, utiization, and demand timeline as it took place during Spark application execution. SparkScope presents information using visual charts which allow to
- find bottlenecks in application execution,
- reconcile resource demand and supply,
- fine tune Spark application for desired objectives.
It is implemented as SparkListener which means that it runs inside driver and listens for Spark events. SparkScope utilizes csv metrics produced by custom SparkScopeCsvSink and supports multiple storage types.
SparkScope produces reports in the following formats
- html
- json
SparkScope HTML reports contains the following features:
-
Stats:
-
Charts:
-
Warnings:
Spark 2 (spark2 branch) | Spark 3 (main branch) | |
---|---|---|
Scala version | 2.11.12 | 2.12.18 |
compatible JDK versions | 7, 8 | 8, 11, 17 |
compatible Spark versions | 2.3, 2.4 | 3.2, 3.3, 3.4, 3.5 |
- S3
- HDFS
- MaprFS
- NFS/local
- Hadoop Yarn(Client and Cluster deploy modes)
- Spark Standalone cluster
parameter | type | sample values | description |
---|---|---|---|
spark.extraListeners | mandatory | com.ucesys.sparkscope.SparkScopeJobListener | Spark listener class |
spark.metrics.conf.driver.source.jvm.class | mandatory | org.apache.spark.metrics.source.JvmSource | jvm metrics source for driver |
spark.metrics.conf.executor.source.jvm.class | mandatory | org.apache.spark.metrics.source.JvmSource | jvm metrics source for executor |
spark.metrics.conf.*.sink.csv.class | mandatory | org.apache.spark.metrics.sink.SparkScopeCsvSink | csv sink class |
spark.metrics.conf.*.sink.csv.period | mandatory | 5 | period of metrics spill |
spark.metrics.conf.*.sink.csv.unit | mandatory | seconds | unit of period of metrics spill |
spark.metrics.conf.*.sink.csv.directory | mandatory | s3://my-bucket/path/to/metrics | path to metrics directory, can be s3,hdfs,maprfs,local |
spark.metrics.conf.*.sink.csv.region | optional | us-east-1 | aws region, required for s3 storage |
spark.metrics.conf.*.sink.csv.appName | optional | MyApp | application name, also used for grouping metrics |
spark.sparkscope.report.html.path | optional | s3://my-bucket/path/to/html/report/dir | path to which SparkScope html report will be saved |
spark.sparkscope.report.json.path | optional | s3://my-bucket/path/to/json/report/dir | path to which SparkScope json report will be saved |
spark.sparkscope.log.path | optional | s3://my-bucket/path/to/log/dir | path to which SparkScope logs will be saved |
spark.sparkscope.log.level | optional | DEBUG, INFO, WARN, ERROR | logging level for SparkScope logs |
spark.sparkscope.diagnostics.enabled | optional | true/false | set to false to disable submitting diagnostics, default=true. |
spark.sparkscope.metrics.dir.driver | optional | s3://my-bucket/path/to/metrics | path to driver csv metrics relative to driver, defaults to "spark.metrics.conf.driver.sink.csv.directory" property value |
spark.sparkscope.metrics.dir.executor | optional | s3://my-bucket/path/to/metrics | path to executor csv metrics relative to driver, defaults to "spark.metrics.conf.executor.sink.csv.directory" property value |
Notes:
- One can choose to put all spark.metrics.conf properties is a metrics.properties file
- Using custom sink(SparkScopeCsvSink) requires adding jar to driver & executors and extending their classpaths.
- --files(spark.files) option should be used
- --jars(spark.jars) option will only make the Sink available for the driver
spark-submit \
--master yarn \
--files ./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--driver-class-path ./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--conf spark.executor.extraClassPath=./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--conf spark.extraListeners=com.ucesys.sparkscope.SparkScopeJobListener \
--conf spark.metrics.conf.driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource \
--conf spark.metrics.conf.executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource \
--conf spark.metrics.conf.*.sink.csv.class=org.apache.spark.metrics.sink.SparkScopeCsvSink \
--conf spark.metrics.conf.*.sink.csv.period=5 \
--conf spark.metrics.conf.*.sink.csv.unit=seconds \
--conf spark.metrics.conf.*.sink.csv.directory=s3://<bucket-name>/<path-to-metrics-dir> \
--conf spark.metrics.conf.*.sink.csv.region=<region> \
--conf spark.metrics.conf.*.sink.csv.appName=My-App \
--conf spark.sparkscope.report.html.path=s3://<bucket-name>/<path-to-html-report-dir> \
--class org.apache.spark.examples.SparkPi \
./spark-examples_2.10-1.1.1.jar 5000
spark-submit \
--master yarn \
--files ./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--driver-class-path ./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--conf spark.executor.extraClassPath=./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--conf spark.extraListeners=com.ucesys.sparkscope.SparkScopeJobListener \
--conf spark.metrics.conf.driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource \
--conf spark.metrics.conf.executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource \
--conf spark.metrics.conf.*.sink.csv.class=org.apache.spark.metrics.sink.SparkScopeCsvSink \
--conf spark.metrics.conf.*.sink.csv.period=5 \
--conf spark.metrics.conf.*.sink.csv.unit=seconds \
--conf spark.metrics.conf.*.sink.csv.directory=hdfs://<path-to-metrics-dir> \
--conf spark.metrics.conf.*.sink.csv.appName=My-App \
--conf spark.sparkscope.report.html.path=hdfs://<path-to-html-report-dir> \
--class org.apache.spark.examples.SparkPi \
./spark-examples_2.10-1.1.1.jar 5000
spark-submit \
--master yarn \
--files ./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--driver-class-path ./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--conf spark.executor.extraClassPath=./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--conf spark.extraListeners=com.ucesys.sparkscope.SparkScopeJobListener \
--conf spark.metrics.conf.driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource \
--conf spark.metrics.conf.executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource \
--conf spark.metrics.conf.*.sink.csv.class=org.apache.spark.metrics.sink.SparkScopeCsvSink \
--conf spark.metrics.conf.*.sink.csv.period=5 \
--conf spark.metrics.conf.*.sink.csv.unit=seconds \
--conf spark.metrics.conf.*.sink.csv.directory=<path-to-metrics-dir> \
--conf spark.metrics.conf.*.sink.csv.appName=My-App \
--conf spark.sparkscope.report.html.path=<path-to-html-report-dir> \
--class org.apache.spark.examples.SparkPi \
./spark-examples_2.10-1.1.1.jar 5000
Instead of specifying spark.metrics.conf.* as separate properties, we can also specify them in metrics.properties file:
# Enable CsvSink for all instances by class name
*.sink.csv.class=org.apache.spark.metrics.sink.SparkScopeCsvSink
# Polling period for the CsvSink
*.sink.csv.period=5
# Unit of the polling period for the CsvSink
*.sink.csv.unit=seconds
# Polling directory for CsvSink
*.sink.csv.directory=hdfs:///tmp/csv-metrics
# JVM SOURCE
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
And specifying path to metrics.properties file in spark-submit command:
spark-submit \
--master yarn \
--files ./sparkscope-spark3-0.1.9-SNAPSHOT.jar,./metrics.properties \
--driver-class-path ./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--conf spark.executor.extraClassPath=./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--conf spark.extraListeners=com.ucesys.sparkscope.SparkScopeJobListener \
--conf spark.metrics.conf=./metrics.properties \
--conf spark.sparkscope.report.html.path=hdfs://<path-to-html-report-dir> \
--class org.apache.spark.examples.SparkPi \
./spark-examples_2.10-1.1.1.jar 5000
Your application needs to have eventLog and metrics configured(but not the listener)
spark-submit \
--master yarn \
--files ./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--driver-class-path ./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--conf spark.executor.extraClassPath=./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--conf spark.metrics.conf.driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource \
--conf spark.metrics.conf.executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource \
--conf spark.metrics.conf.*.sink.csv.class=org.apache.spark.metrics.sink.SparkScopeCsvSink \
--conf spark.metrics.conf.*.sink.csv.period=5 \
--conf spark.metrics.conf.*.sink.csv.unit=seconds \
--conf spark.metrics.conf.*.sink.csv.directory=<path-to-metrics-dir> \
--conf spark.metrics.conf.*.sink.csv.appName=My-App \
--class org.apache.spark.examples.SparkPi \
./spark-examples_2.10-1.1.1.jar 5000
Running sparkscope as java-app
java \
-cp ./sparkscope-spark3-0.1.9-SNAPSHOT.jar:$(hadoop classpath) \
com.ucesys.sparkscope.SparkScopeApp \
--event-log <path-to-event-log> \
--html-path <path-to-html-report-dir> \
--json-path <path-to-json-report-dir> \
--log-path <path-to-log-dir> \
--log-level <logging level> \
--diagnostics <true/false> \
--region <aws-region>
SparkScope analysis summary should be printed out to the console:
28/09/2023 01:20:22 INFO [SparkScope] SparkScope analysis took 0.052s
28/09/2023 01:20:22 INFO [SparkScope]
____ __ ____
/ __/__ ___ _____/ /__ / __/_ ___ ___ ___
_\ \/ _ \/ _ `/ __/ '_/_\ \/_ / _ \/ _ \/__/
/___/ .__/\_,_/_/ /_/\_\/___/\__\_,_/ .__/\___/
/_/ /_/ v0.1.1
28/09/2023 01:20:22 INFO [SparkScope] Executor stats:
Executor heap size: 800MB
Max heap memory utilization by executor: 286MB(35.80%)
Average heap memory utilization by executor: 156MB(19.56%)
Max non-heap memory utilization by executor: 49MB
Average non-heap memory utilization by executor: 35MB
28/09/2023 01:20:22 INFO [SparkScope] Driver stats:
Driver heap size: 910
Max heap memory utilization by driver: 262MB(28.87%)
Average heap memory utilization by driver: 207MB(22.78%)
Max non-heap memory utilization by driver: 67MB
Average non-heap memory utilization by driver: 65MB
28/09/2023 01:20:22 INFO [SparkScope] Cluster Memory stats:
Average Cluster heap memory utilization: 19.56% / 156MB
Max Cluster heap memory utilization: 35.80% / 286MB
heapGbHoursAllocated: 0.0033
heapGbHoursAllocated=(executorHeapSizeInGb(0.78125)*combinedExecutorUptimeInSec(15s))/3600
heapGbHoursWasted: 0.0006
heapGbHoursWasted=heapGbHoursAllocated(0.0033)*heapUtilization(0.1956)
28/09/2023 01:20:22 INFO [SparkScope] Cluster CPU stats:
Total CPU utilization: 68.35%
coreHoursAllocated: 0.0042
coreHoursAllocated=(executorCores(1)*combinedExecutorUptimeInSec(15s))/3600
coreHoursWasted: 0.0029
coreHoursWasted=coreHoursAllocated(0.0042)*cpuUtilization(0.6835)
28/09/2023 01:20:22 INFO [SparkScope] Wrote HTML report file to /tmp/app-20230928132004-0012.html